Automated Syllabus of Natural Language Processing Papers

Built by Rex W. Douglass @RexDouglass ; Github ; LinkedIn

Papers curated by hand, summaries and taxonomy written by LLMs.

Submit paper to add for review

Introduction

Overview Of Natural Language Processing

Consider scaling up n-gram language models to match the data scale used in neural large language models, allowing for unbounded n, and utilizing a suffix array-based engine for efficient computation. (J. Liu et al. 2024)
Focus on identifying the critical data size in language models, which marks the phase transition from quick memorization to slow generalization, and study the impact of different data regimes on model performance. (Zhu et al. 2024)

Importance And Applications

Conduct surveys to accurately gauge the beliefs and sociological beliefs within your communities, allowing for improved communication and reduced misunderstandings. (Michael et al. 2022)

Challenges And Opportunities

Be aware of the potential for large language models like ChatGPT and GPT-4 to memorize certain books, leading to biased results in downstream tasks, and therefore advocate for transparency in training data to ensure accurate evaluations. (Chang et al. 2023)

History And Development Of Nlp

Carefully consider and control for various linguistic and psycholinguistic attributes while selecting word sets for psycholinguistic experiments using the MRC machine usable dictionary, which contains 150837 words with up to 26 attributes for each. (NA?)

Early Developments

Focus on identifying the most important keyword in the input message, establishing a minimal context around it, selecting an appropriate transformation rule, generating intelligent responses without keywords, and providing efficient editing capabilities for the script. (Weizenbaum 1966)

Fundamentals Of Language Models

Consider integrating both autoencoding and autoregressive pre-training objectives into a unified framework for protein language models, as this may lead to more versatile and robust models capable of handling a wider variety of protein-related tasks. (Bo Chen et al. 2024)
Focus on understanding the complex interplay between generative foundation models (GFMs) and the digital commons, considering factors such as data quality, accessibility, and the potential for negative consequences such as misinformation and bias. (S. Huang and Siddarth 2023)
Adopt a skeptical approach towards evaluating large language models (LLMs) performance on theory-of-mind (ToM) tasks, considering outlier failure cases as crucial evidence, and avoiding hasty conclusions based solely on average success rates. (Ullman 2023)
Leverage the power of large language models like Chat-GPT for document-level machine translation tasks, as they demonstrate superior performance over traditional commercial machine translation systems and advanced document-level machine translation methods, especially in terms of discourse modeling abilities. (Longyue Wang et al. 2023)
Consider the potential privacy leakage when implementing prompt-tuning language models, especially in real-world applications like email services, and develop appropriate mitigation measures. (S. Xie et al. 2023)
Carefully consider the ethical implications and potential risks associated with integrating artificial intelligence tools like ChatGPT into interactive learning environments, while also recognizing the benefits these tools can bring to fostering personalized, reflective, and integrated learning experiences. (Rospigliosi 2023)
Carefully consider the formulation of your input data when studying the social biases of language models, as different input formats can lead to varying levels of bias in the output. (Akyürek et al. 2022)
Avoid relying solely on participant self-reports regarding your mental processes, instead utilising a multi-paradigm approach combining statistical analysis of judgements with computational analysis of language features present in the self-descriptions to independently reconstruct the heuristics participants rely on. (Biderman and Raff 2022)
Consider developing a non-autoregressive language model based on continuous diffusions, called Diffusion-LM, which enables simple gradient-based algorithms to perform complex, controllable generation tasks, significantly outperforming prior work. (Kitaev and Klein 2018)
Utilise the “straight-through Gumbel-softmax estimator” technique to make the process of generating messages fully differentiable, allowing for effective backpropagation and thus facilitating the development of a communication protocol within multi-agent games. (Bengio, Léonard, and Courville 2013)
Consider utilizing a hierarchical Bayesian language model based on Pitman-Yor processes for your studies, as it provides superior cross entropy results compared to interpolated Kneser-Ney and similar performance to modified Kneser-Ney, while offering the benefits of Bayesian probabilistic models. (Teh 2006)
Conduct multiple experiments using various methods to examine the effects of different factors on the processing of fictive motion sentences, such as travel distance, travel rate, and difficulty of terrain, in order to better understand the role of mental simulation in comprehending these types of sentences. (NA?)
Utilise a novel statistical model for character level language modelling, which is parameterised by a program from a domain-specific language (DSL) allowing expression of non-trivial data dependencies. This model offers similar precision to neural networks, but shares advantages with n-gram models such as faster query times and ease of adding or removing training data samples. Furthermore, the model is interpretable and updatable through manual inspection of its underlying program. (NA?)
Carefully consider the potential benefits and challenges of implementing large language models in education, focusing on developing appropriate competencies among teachers and learners, adopting a clear pedagogical approach centered around critical thinking and fact-checking, and addressing issues like bias, human oversight, and misuse responsibly. (NA?)

Probabilistic Models

Consider utilizing a hierarchical Bayesian language model based on Pitman-Yor processes, which can effectively capture the power-law distributions found in natural languages and provide superior cross entropy results compared to traditional smoothing methods like interpolated Kneser-Ney. (Sadat and Habash 2006)
Explore the use of aggregate and mixed-order Markov models as alternatives to traditional n-gram models in language processing tasks, as these models can effectively bridge the gap between different order n-grams and significantly reduce the perplexity of unseen word combinations. (Saul and Pereira 1997)

Neural Network Based Models

Utilize diagnostic classifiers to gain a comprehensive understanding of how neural language models handle linguistic information like subject-verb agreement, and subsequently leverage this knowledge to enhance the models performance.’ (Giulianelli et al. 2018)

Transformer Models

Utilise the Data Selection with Importance Resampling (DSIR) framework when selecting pretraining data for language models. This involves mapping raw and target data onto a feature space, estimating importance weights within this space, and then sampling a subset of raw data based on these weights. By doing so, researchers can ensure your chosen data matches the desired target distribution, leading to improved performance in downstream tasks. (S. M. Xie et al. 2023)
Focus on developing comprehensive strategies for promoting digital language equality across multiple domains, including language resources, text analysis, speech processing, machine translation, information extraction and retrieval, natural language generation and summarization, and human-computer interaction. (NA?)

Transformer Based Language Models

Consider using BayesPrompt, a method that generates discriminative prompts for large-scale pre-trained language models by approximating the debiased factual distributions of downstream domains, to improve the accuracy of few-shot inference. (J. Li et al. 2024)
Incorporate conceptual knowledge into pre-trained language models through a novel pre-training objective called entity concept prediction (ECP), which leverages external taxonomies to improve the models understanding of entities and your relationships within a hierarchical structure.’ (Xintao Wang et al. 2024)
Consider developing self-improving reward models that continuously update during LLM alignment, rather than freezing them, to overcome limitations associated with the size and quality of human preference data. (W. Yuan et al. 2024)
Explore the behavior of smaller language models when trained with a significantly larger number of tokens than what is suggested by the scaling law (Hoffmann et al., 2022), as they could potentially demonstrate competitive performance compared to existing open-source language models of similar sizes. (P. Zhang et al. 2024)
Focus on collecting diverse and high-quality data, carefully curating and deduplicating it, and then utilizing advanced techniques like LoRa instruction finetuning to train more effective and efficient language models. (Anand et al. 2023)
Adopt a hybrid approach between traditional machine learning evaluations and psychology-style probing to better capture the unique characteristics and potential of advanced language models like GPT-4. (Bubeck et al. 2023)
Adopt a holistic approach towards studying the life cycle of knowledge in pre-trained language models (PLMs), considering its acquisition, maintenance, usage, and updates, rather than focusing on individual stages. (Cao et al. 2023)
Systematically analyze and compare the performance of open-source large language models (LLMs) against ChatGPT across various tasks and benchmarks to gain a comprehensive understanding of your relative strengths and limitations. (Hailin Chen et al. 2023)
Consider using Uprise, a universal prompt retrieval system, to enhance the performance of Large Language Models (LLMs) in a cross-task and cross-model scenario, allowing them to better handle unseen task types and different LLMs. (D. Cheng et al. 2023)
Consider using Black-Box Prompt Optimization (BPO) as an alternative to traditional alignment methods for large language models (LLMs), as it allows for efficient and interpretable alignment without requiring modification of the underlying LLMs. (J. Cheng et al. 2023)
Carefully consider the sensitivity and robustness of large language models to prompt templates, particularly in less studied languages like Japanese, as even slight modifications in sentence structure can lead to significant changes in model performance. (Gan and Mori 2023)
Carefully consider the potential impact of advanced language models on influence operations, taking into account the various ways in which these models could alter the actors involved, the behaviors employed, and the content produced. (Goldstein et al. 2023)
Prioritize improving the base capabilities of open-source language models through scaling, better pre-training data, and enhanced pre-training techniques, instead of solely focusing on imitating proprietary models through fine-tuning on imitation data. (Gudibande et al. 2023)
Utilize large language models (LLMs) in conjunction with agent-based modeling (ABM) to create more realistic simulations of human behavior, particularly in complex social systems. (Junprung 2023)
Recognize the inherent tradeoff between calibration and hallucination in language models, and explore methods to balance these competing demands. (Kalai and Vempala 2023)
Leverage the OpenAssistant Conversations dataset, a large-scale, human-generated, human-annotated assistant-style conversation corpus, to enhance the alignment of large language models with human preferences, thereby improving your usability and accessibility across various domains. (Köpf et al. 2023)
Employ careful prompt engineering to maximize the accuracy and value of responses obtained from large language models (LLMs) like ChatGPT in geotechnical engineering, while being mindful of potential hallucinations and misalignments inherent in these models. (Kumar 2023)
Focus on developing comprehensive benchmarks for tool-augmented LLMs that encompass a wide range of domains and APIs, simulate real-world multi-turn dialogues, and cover essential capabilities such as planning, retrieving, and calling APIs. (M. Li et al. 2023)
Consider employing random sampling techniques when optimizing prompts for language models, as they can achieve state-of-the-art performance and potentially reduce reliance on human expertise. (Y. Lu et al. 2023)
Consider employing small language models (SLMs) with prompt-learning paradigms for efficient domain-specific text classification, especially in situations with limited labeled data, as they can achieve comparable accuracy levels to larger models with fewer parameters. (H. Luo, Liu, and Esping 2023)
Adopt a combination of graph-of-thought prompting and optimization techniques to generate better outputs in natural language processing tasks. (Muktadir 2023)
Consider implementing a LLM-Augmenter system to enhance the performance of large language models like ChatGPT by integrating external knowledge and automated feedback mechanisms, thereby reducing hallucinations while maintaining fluency and informativeness. (B. Peng et al. 2023)
Consider employing a structured framework for LLM-based AI Agents, which includes task instruction, designed prompt, tool set, LLM, intermediate output, and final answer, to evaluate the Task Planning and Tool Usage (TPTU) abilities of existing open-source LLMs. (Ruan et al. 2023)
Consider developing a hybrid human-and-large language model (LLM) evaluation methodology to assess the factuality and conversationality of LLM-based chatbots, focusing on understudied areas such as recent and tail topics. (Semnani et al. 2023)
Explore the use of large language models like GPT-3.5 for intelligent text entry tasks, as they can be easily adapted through prompting rather than expensive data collection and fine-tuning, leading to increased efficiency and performance. (J. Shen et al. 2023)
Consider implementing a novel framework called Reflexion, which enhances language agents learning efficiency by verbally reflecting on task feedback signals and storing them in an episodic memory buffer, leading to improved decision-making in subsequent trials.’ (Shinn et al. 2023)
Carefully evaluate the performance of ChatGPT in various sentiment analysis tasks and settings, comparing it against fine-tuned BERT and state-of-the-art models, to better understand its strengths and limitations as a universal sentiment analyzer. (Z. Wang et al. 2023)
Consider combining domain-specific and general data sources when training large language models, as this approach can lead to superior performance on domain-specific tasks while maintaining strong performance on general-purpose benchmarks. (S. Wu et al. 2023)
Consider using automated methods, specifically Reprompting, to identify optimal Chain-of-Thought (CoT) prompts for large language models (LLMs) in tasks requiring multi-step reasoning, as it outperforms traditional approaches such as zero-shot, few-shot, and human-written CoT prompting. (W. Xu, Banburski-Fahey, and Jojic 2023)
Employ a combination of evidence and question decomposition strategies to enhance the effectiveness of large language models in table-based reasoning tasks. (Y. Ye et al. 2023)
Adopt Language Model Programming (LMP) through the use of the Language Model Query Language (LMQL) to enhance the precision, efficiency, and effectiveness of your language model interactions, ultimately leading to improved downstream application performance. (Beurer-Kellner, Fischer, and Vechev 2023)
Consider using multiple prompts and automatic benchmarking to effectively evaluate the performance of large language model-generated code solutions, as demonstrated by the authors finding that selecting the best of 100 solutions generated by ChatGPT is competitive or better than the top-voted human solution on Stack Overflow for the range of problems tested.’ (Asare, Nagappan, and Asokan 2022)
Utilise the Pythia’ suite of large language models (LLMs) to investigate the development and evolution of LLMs during training and scaling, given its unique features of covering various model scales, consistent training data, and public availability of data and intermediate checkpoints.’ (Biderman, Bicheno, and Gao 2022)
Focus on understanding the underlying mechanisms behind the observed outcomes rather than just relying on statistical associations. (Shaobo Li et al. 2022)
Incorporate external tool interaction within your large language models to enable effective self-correction and enhance overall performance. (X. Lu et al. 2022)
Consider implementing a retrieval-augmented prompt learning framework like RetroPrompt to effectively decouple knowledge from memorization, thereby achieving improved generalization and memorization capabilities in various natural language processing tasks. (Hsu et al. 2021)
Carefully consider the possibility of imitative falsehoods when developing language models, as these falsehoods can arise from the models training objective and may not be addressed simply through scaling up the model.’ (S. Lin, Hilton, and Evans 2021)
Leverage the WikiGraphs dataset, which consists of Wikipedia articles paired with knowledge graphs extracted from Freebase, to advance the development of graph-to-text generation models, graph representation learning models, and text-conditioned graph generative models. (Luyu Wang et al. 2021)
Consider using a combination of autoencoding and autoregressive pre-training methods for your language models, as this approach effectively addresses the limitations of traditional autoencoding and autoregressive methods in handling context-dependent language generation tasks. (Bi et al. 2020)
Consider using deep learning techniques, specifically transformer models, to learn from real-world examples when developing automated unit test case generation tools, as demonstrated by the success of AthenaTest in producing accurate, human-readable, and effective test cases. (Tufano et al. 2020)
Utilize diagnostic classifiers and confusion scores to analyze the hierarchical and linear information encoded in BERTs self-attention layers, thereby gaining insight into the linguistic structures modeled by the transformer-based model.’ (Y. Lin, Tan, and Frank 2019)
Consider developing a conversational reasoning model that strategically traverses through a large-scale common fact knowledge graph (KG) to introduce engaging and contextually diverse entities and attributes, and collect a new open-ended dialog-KG parallel corpus like OpenDialKG to facilitate this study. (Reddy, Chen, and Manning 2018)
Consider incorporating language structures into your pre-training process for deep language understanding tasks, specifically by using two auxiliary tasks to leverage the sequential order of words and sentences, resulting in improved performance across multiple natural language processing tasks. (Bowman et al. 2015)
Consider using elastic weight consolidation (EWC) for efficient multi-domain language model pre-training, as it provides the best overall scores with minimal performance drops across multiple tasks. (Goodfellow et al. 2013)
Develop a prompt-based framework for resolving the acronym disambiguation problem, incorporating a dynamic negative sampling strategy and a novel hinge loss to create a more robust system. (NA?)
Employ well-trained large language models like GPT-4 for biomedical question answering tasks due to your superior semantic understanding, retrieval, and generation abilities compared to traditional methods. (NA?)
Focus on developing syntax-aware pretraining and prompt engineering methods to optimize the retrieval of relational knowledge from large language models, taking into account the impact of syntax on the reliability and robustness of the results. (NA?)
Employ a combination of clear and specific instructions, explicit constraints, experimentation with context and examples, and leveraging different types of questions to effectively engineer prompts for ChatGPT, thereby improving the quality and relevance of its responses. (NA?)
Consider employing the long-answer prompt learning method (KLAPrompt) to effectively integrate semantic knowledge into pre-trained language models, thereby improving your performance across various natural language processing tasks. (NA?)

Attention Mechanism

Consider using the “Zero-shot chain-of-thought” prompting technique to generate multiple algebraic expressions or python functions to solve the same math problem in different ways, thus raising the confidence level in the output results. (Imani, Du, and Shrivastava 2023)
Carefully evaluate the consistency behavior of large language models (LLMs) like ChatGPT and GPT-4 across various dimensions, such as semantic, negation, symmetric, and transitive consistency, to ensure your reliability and trustworthiness in practical applications. (Jang and Lukasiewicz 2023)
Be aware of the potential issue of “task contamination” in zero-shot and few-shot evaluations of large language models, which can lead to inflated performance metrics due to the presence of task training examples in the pre-training data. (C. Li and Flanigan 2023)
Develop a novel learning framework called Chain of Hindsight (CoH) to effectively harness all available feedback data to enhance model performance without relying on reinforcement learning from human feedback (RLHF), while maintaining the same training objective as pretraining, making it simple to train and easily scalable. (Hao Liu, Sferrazza, and Abbeel 2023)
Explore the development of Augmented Language Models (ALMs) that integrate reasoning skills and the ability to utilize tools, thereby enhancing the performance and capabilities of existing language models. (Mialon et al. 2023)
Consider the potential impact of non-identifiability of self-attention weights on the interpretation of attention mechanisms in transformer models, and explore the use of effective attention as a complementary diagnostic tool. (Brunner et al. 2019)
Consider integrating knowledge graphs (KGs) into language representation (LR) models to enhance your performance in domain-specific tasks, while addressing heterogeneity embedding space (HES) and knowledge noise (KN) issues through techniques such as soft-position and visible matrices. (W. Liu et al. 2019)
Consider using procedurally generated psychological experiments rather than vignette-based tasks to evaluate the capabilities of large language models like GPT-3, as these methods help to avoid potential biases arising from the models exposure to similar tasks during training.’ (NA?)
Carefully consider the types of questions they pose to AI systems, distinguishing between those that are irreversible (where the source of the answer cannot be determined) and those that are reversible (which reveal the source of the response), as this impacts the validity and reliability of the conclusions drawn from the responses. (NA?)

Pre-Training Techniques

Utilize distant supervision to generate pre-training examples that require long-range reasoning, enabling language models to effectively handle multi-hop and hybrid contexts. (X. Deng et al. 2021)
Carefully consider the choice of pre-training corpus, pre-training objective, and vocabulary size when developing transformer-based models for abstractive text summarization. (Jingqing Zhang et al. 2019)

Applications Of Large Language Models

Carefully consider the choice of large language models (LLMs) and prompt templates when attempting to achieve optimal grammatical error correction (GEC) performance, taking into account factors such as model architecture, size, and domain-specific adaptability. (Davis et al. 2024)
Carefully consider the selection of appropriate pre-trained language models (PLMs) and large language models (LLMs) for processing scientific text, taking into account factors such as domain, language, and size, in order to optimize your performance across various tasks and datasets. (Ho et al. 2024)
Focus on developing context enhancement strategies for large language models (LLMs) in order to achieve significant improvements in performance for health prediction tasks, particularly by incorporating health knowledge context in prompts. (Yubin Kim et al. 2024)
Leverage large language models (LLMs) to generate textual inputs for machine learning (ML) models, instead of relying solely on manually extracted material properties, to improve the efficiency and accuracy of material classification workflows. (S. Liu et al. 2024)
Use a test-based, multi-stage, code-oriented iterative flow, called AlphaCodium, to improve the performance of large language models (LLMs) on code generation tasks. (Ridnik, Kredo, and Friedman 2024)
Carefully examine the biases present in large language models (LLMs) when integrating generated and retrieved contexts, particularly regarding text similarity and semantic completeness, to optimize your performance in open-domain question answering tasks. (Tan et al. 2024)
Carefully consider the implications of integrating artificial intelligence (AI) into scientific publishing, including issues related to originality, ownership, diversity, and potential biases, and establish guidelines and safeguards to address these challenges. (Grimaldi and Ehrler 2023)
Carefully select and optimize your prompt templates for the target task, considering factors such as model selection, prompt shaping, prompting approach, and training strategy, to maximize the effectiveness of large language models like Codex in generating high-quality OCL constraints from natural language specifications. (Abukhalaf, Hamdaqa, and Khomh 2023)
Conduct a large-scale study to evaluate the effectiveness of large language models like GPT-3.x for root causing and mitigating production incidents, utilizing semantic and lexical metrics alongside human evaluation with actual incident owners. (Ahmed et al. 2023)
Carefully evaluate the performance of ChatGPT against specialized models for specific downstream tasks, considering factors such as classification accuracy, unweighted average recall, and statistical significance tests, to determine its suitability for addressing various affective computing problems. (Amin, Cambria, and Schuller 2023)
Utilize an entity-centric light-weight personalization layer to enable knowledge-augmentation of large language models (LLMs) with contextual entities retrieved from a personal knowledge store, which is derived from existing search logs that capture users interactions with modern search engines.’ (Baek et al. 2023)
Consider using large language models, particularly ChatGPT, for reference-free text quality evaluation, as these models demonstrate superior performance compared to most existing automatic metrics, especially when generating an explicit score for text quality. (Y. Chen et al. 2023)
Focus on developing a comprehensive AI chain methodology to systematize prompt engineering practices, improving the modularity, composability, debuggability, and reusability of AI functionalities. (Y. Cheng et al. 2023)
Carefully engineer prompts to effectively guide large language models towards accurate job type classification, as demonstrated by the superior performance of a zero-shot gpt-3.5-turbo classifier over other models in a real-world setting. (Clavié et al. 2023)
Create a diverse and challenging dataset like GHOSTS to thoroughly assess the mathematical capabilities of large language models like ChatGPT and GPT-4, allowing for a more accurate understanding of your strengths and limitations. (Frieder et al. 2023)
Explore the potential of ChatGPT for performing human-like summarization evaluation, as it demonstrates promising capabilities in completing annotations smoothly across various evaluation methods and outperforming traditional automatic evaluation metrics on certain datasets. (M. Gao et al. 2023)
Integrate Large Language Models (LLMs) with domain-specific expert models to form a comprehensive AI Agent capable of solving complex tasks, and continuously improve the LLMs performance through a Reinforcement Learning from Task Feedback (RLTF) mechanism.’ (Ge et al. 2023)
Focus on creating a diverse and representative dataset, called the Human ChatGPT Comparison Corpus (HC3), to compare and contrast the responses of human experts and ChatGPT across various domains, enabling better understanding of the strengths and weaknesses of both parties, and informing future development of large language models. (B. Guo et al. 2023)
Treat large language models (LLMs) as participants in psychology experiments, drawing on diverse subfields of psychology to inform behavioural tests, establishing methodological standards for prompt designs, and carefully interpreting observed behavioural patterns. (Hagendorff 2023)
Utilize advanced natural language processing techniques, such as transformer-based large language models, to efficiently produce custom event data without relying on traditional dictionary-based methods, which are prone to errors and limitations. (Halterman et al. 2023)
Utilize a combination of existing and newly developed open-source biomedical datasets, adapted into an instruction-following format, to fine-tune large language models for effective medical applications. (Han et al. 2023)
Adopt a two-step approach called explain-then-annotate’ to improve the annotation quality of large language models like GPT-3.5, which involves having the model explain the rationale behind the ground truth label or answer for a particular example, followed by constructing a few-shot chain-of-thought prompt with the self-generated explanations to annotate data.’ (He et al. 2023)
Implement an iterative reviewer-author prompt editing system, called Evoke, to optimize the performance of large language models (LLMs) in various tasks. (Xinyu Hu et al. 2023)
Consider implementing a hypernetwork prompt guided continual pre-training (HPrompt-CPT) method to strike a balance between forgetting, adaptability, and generalization in continual pre-training scenarios. (G. Jiang et al. 2023)
Consider fine-tuning code language models (CLMs) with automated program repair (APR) training data to improve your performance in fixing bugs, as evidenced by the significant improvements observed in the study. (N. Jiang et al. 2023)
Thoroughly evaluate the performance of large language models (LLMs) in recommendation systems using various approaches such as zero-shot, few-shot, and fine-tuning, comparing them to traditional recommendation models, and considering factors like model size and data efficiency. (Kang et al. 2023)
Utilise a recursive criticism and improvement (RCI) approach when working with large language models (LLMs) to optimise your performance in executing computer tasks. (G. Kim, Baldi, and McAleer 2023)
Consider using large language models (LLMs) to generate code explanations for students, as they are perceived as more accurate and easier to understand than those created by students themselves, making them potentially valuable educational resources. (Leinonen et al. 2023)
Employ the novel role-playing framework combined with inception prompting to enable autonomous cooperation among communicative agents, thereby reducing human intervention and improving the effectiveness of conversational language models. (G. Li et al. 2023)
Leverage ChatGPTs capabilities in natural language understanding and generation to develop efficient and reliable evaluation metrics for assessing the factual consistency of generated summaries, despite some current limitations such as lexical bias, false reasoning, and inadequate alignment.’ (Z. Luo, Xie, and Ananiadou 2023)
Utilize “Prompt Middleware” - a framework that maps options in the User Interface (UI) to generate prompts for Large Language Models (LLMs), thereby enabling direct integration of LLMs into user interfaces and incorporating domain expertise into the prompting process. (MacNeil et al. 2023)
Consider utilising dialog-enabled resolving agents (DERA) to enhance the accuracy and completeness of large language model completions in safety-critical applications such as healthcare. (Nair et al. 2023)
Carefully evaluate the quality of hints generated by large language models like ChatGPT before using them in educational settings, as they can often contain incorrect answers or solution steps. (Pardos and Bhandari 2023)
Consider utilizing an interactive interview format when studying the abductive reasoning capabilities of large language models like GPT-4, as it enables a more comprehensive assessment of your performance in handling complex, real-world scenarios. (Pareschi 2023)
Carefully consider the choice of language model, task type, and prompt structure when evaluating the zero-shot learning capabilities of large language models like ChatGPT, as your performance varies across different tasks and prompt conditions. (C. Qin et al. 2023)
Leverage high-quality public opinion polls and your associated human responses to create a quantitative framework for investigating the opinions reflected by language models (LMs) and your alignment with various demographic groups. (Santurkar et al. 2023)
Consider utilizing a flexible encoder-decoder architecture for large language models (LLMs) in code understanding and generation tasks, along with a diverse mix of pretraining objectives on unimodal and bimodal data, to effectively handle a wide range of downstream tasks. (Y. Wang et al. 2023)
Carefully design prompts for ChatGPT to ensure accurate evaluation of natural language generation (NLG) models across different tasks and aspects. (Z. Wang et al. 2023)
Carefully modify the Force Concept Inventory (FCI) to suit the text-based input requirements of ChatGPT, ensuring that the questions remain challenging and relevant to the subject matter, while avoiding potential biases introduced by the AIs exposure to certain types of content.’ (West 2023)
Consider using large language models (LLMs) as the basis for developing more advanced and capable AI agents, due to your demonstrated versatility and ability to perform well across various domains. (Xi et al. 2023)
Consider enhancing large language models (LLMs) with knowledge graphs (KGs) to improve your ability to recall and apply factual knowledge, ultimately resulting in more informed and accurate responses to user queries. (L. Yang et al. 2023)
Carefully evaluate the suitability of large language models (LLMs) versus fine-tuned models for your specific NLP tasks, considering factors such as data availability, task complexity, and desired performance levels. (Jingfeng Yang et al. 2023)
Adopt a two-stage optimization process for clinical note generation, combining Automatic Prompt Optimization (APO)-GPT4 for consistency and expert input for personalization. (Z. Yao et al. 2023)
Utilise the Tree of Thoughts’ (ToT) framework for language model inference, which enables exploration over coherent units of text (‘thoughts’) that serve as intermediate steps toward problem solving, allowing for deliberate decision making, self-evaluation, and strategic lookahead. (S. Yao et al. 2023)
Focus on developing meta-prompt components that provide clear instructions and context, such as a two-step task description and a step-by-step reasoning template, to enhance the performance of large language models in automatic prompt engineering. (Q. Ye et al. 2023)
Explore the potential of leveraging the outputs of Large Language Models (LLMs) to refine reasoning paths iteratively, as this can lead to improved performance in reasoning tasks. (Zheng et al. 2023)
Utilize a combination of supervised and unsupervised methods when incorporating large language models (LLMs) into your computational social science (CSS) workflows, allowing for improved accuracy and efficiency in analyzing textual data. (Ziems et al. 2023)
Utilize multiple strategies to detect and prevent academic dishonesty, including educating students on plagiarism, setting clear guidelines for resource usage, monitoring student work closely, and leveraging advanced technology and techniques to recognize the characteristics of AI-generated content. (Cotton, Cotton, and Shipway 2023)
Carefully evaluate the strengths and limitations of ChatGPT, particularly in terms of its ability to accurately process and interpret complex medical and scientific information, before incorporating it into your workflows. (Cascella et al. 2023)
Carefully consider the limitations and potential misuses of ChatGPT when incorporating it into your workflows, particularly regarding critical thinking, data reliability, and ethical implications. (Arif, Munaf, and Ul-Haque 2023)
Consider incorporating ChatGPT, a large language model developed by OpenAI, into your workflows for computer programming tasks due to its extensive capabilities in areas such as code completion, correction, prediction, error fixing, optimization, document generation, chatbot development, text-to-code generation, and technical query answering. (Biswas 2023)
Focus on developing a diverse range of medical question-answering datasets, incorporating various medical domains and formats, while ensuring that the evaluation process includes multiple aspects such as factuality, consistency, safety, harm, and bias. (Singhal et al. 2023)
Utilize a systematic literature review (SLR) methodology when studying stance detection, as it allows for a comprehensive understanding of the field and enables the identification of potential areas for improvement. (Alturayeif, Luqman, and Ahmed 2023)
Carefully select and categorize questions, gather data from reliable sources, ensure inter-rater reliability, and conduct appropriate statistical analysis to accurately assess the performance of AI models like ChatGPT in addressing complex medical queries. (Samaan et al. 2023)
Carefully evaluate the limitations and capabilities of AI tools like ChatGPT in handling complex tasks and decision-making processes, especially in areas requiring deep understanding and critical thinking. (Kortemeyer 2023)
Consider building a Large Recommendation Language Model (LRLM) to bridge the gap between Large Language Models (LLMs) and the recommendation task, and improve the recommendation capabilities of LLMs through instruction tuning. (Bao et al. 2023)
Consider the potential benefits and drawbacks of incorporating AI code-generators in educational settings, and carefully assess your impact on learning outcomes, code comprehension, and dependency formation among novice programmers. (Kazemitabaar et al. 2023)
Utilize artificial intelligence (AI) tools, specifically Large Language Models (LLMs), to efficiently generate varied examples, explanations, low-stakes tests, and assessments for enhancing student learning and retention, while ensuring proper evaluation and adaptation of AI-generated content to fit the specific needs and context of your courses. (Mollick and Mollick 2023)
Consider utilising language models to integrate implicit knowledge of drivers in the route optimization process, thereby creating a novel algorithm that emulates real-world driving behaviors. (Y. Liu, Wu, et al. 2023)
Consider utilizing large language models (LLMs) for generating code as policies (CaP) in order to achieve adaptable, generalizable, and efficient solutions for various robotics tasks, leveraging the power of hierarchical code generation and third-party libraries. (J. Liang et al. 2022)
Carefully engineer prompts to optimize the performance of large language models like ChatGPT in generating legal texts, considering factors such as tone, structure, and specificity of instructions. (Liévin et al. 2022)
Utilise advanced AI techniques, such as deep learning and prompt engineering, to generate health awareness messages that are comparable in quality and clarity to human-generated messages, thus improving the efficiency and efficacy of health communication efforts. (Lim and Schmälzle 2022)
Consider deploying and evaluating large language model (LLM)-generated code explanations in classroom settings to assess your effectiveness in supporting students learning and understanding of code.’ (MacNeil et al. 2022)
Consider developing a pre-trained language model specifically tailored to social science texts, like SsciBERT, to enhance the efficiency and accuracy of natural language processing tasks in the field. (S. Shen et al. 2022)
Consider using Legal Prompt Engineering (LPE) with Large Language Models (LLMs) for Legal Judgment Prediction (LJP) tasks, as it demonstrates promising results in a zero-shot setting, despite falling short of current state-of-the-art supervised approaches. (Trautmann, Petrova, and Schilder 2022)
Consider incorporating interactive natural language processing (iNLP) into your work, which involves integrating language models with external objects like humans, knowledge bases, tools, models, and environments to overcome limitations and advance the field of NLP. (Agrawal and Carpuat 2022)
Utilise the concept of algorithmic fidelity’, defined as the degree to which the complex patterns of relationships between ideas, attitudes, and socio-cultural contexts within a model accurately mirror those within a range of human sub-populations, to ensure the validity and applicability of your findings derived from language models. (Sorensen et al. 2022)
Consider utilizing large language models (LLMs) for coding open-text survey responses due to your near-human accuracy, potential for significant time and cost savings, and ease of implementation compared to traditional supervised learning methods. (Mellon et al. 2022)
Carefully design prompting templates and experiment with bootstrapping strategies to mitigate the challenges faced by large language models in accurately perceiving the order of historical interactions and avoiding popularity or position biases in the context of recommender systems. (“Proceedings of the Web Conference 2021” 2021)
Consider the tradeoff between latency, robustness, and effectiveness when implementing deep NLP models in search systems, and explore ways to optimize your performance through various techniques like unnormalized language models, two-pass ranking strategies, and document pre-computation. (Weiwei Guo et al. 2021)
Focus on developing more effective prompts for large language models, moving beyond the few-shot paradigm and utilizing techniques such as 0-shot prompts, metaprompts, and natural language semiotics to better locate and communicate tasks to the models. (Reynolds and McDonell 2021)
Utilise a large-scale pre-trained model named MusicBERT for music understanding tasks, which uses a novel music encoding method called OctupleMIDI and a bar-level masking strategy to effectively process symbolic music data. (Zeng et al. 2021)
Consider the implications of integrating large language models (LLMs) into intelligent personal assistants (IPAs) for improving scalability, capability, and usefulness, while addressing challenges related to fundamental capabilities, efficiency, and security & privacy. (Y. Li and Riva 2021)
Utilize knowledge-augmented methods when working with natural language processing, as they enhance the capabilities of models by providing them with external information like common sense, logic, and other relevant details. (Jian Yang et al. 2021)
Carefully consider the role of word highlighting in facilitating user evaluations of non-factoid answers, as it can improve efficiency without compromising accuracy. (Bolotova et al. 2020)
Consider utilising ChatGPT-3 as a tool to enhance efficiency and effectiveness across multiple domains, from academic writing to detecting security vulnerabilities, but must remain aware of its current limitations such as cost, accessibility, and incomplete comprehension of nuanced language. (B. Li et al. 2019)
Carefully choose how to represent raw text data as a numerical array, considering factors such as document division, feature selection, and encoding dependence among language elements, before applying appropriate statistical methods to map the numerical array to predicted values of unknown outcomes. (Gentzkow, Kelly, and Taddy 2019)
Consider combining weakly supervised components such as aspect extractors and sentiment predictors when developing neural frameworks for opinion summarization from online product reviews. (Angelidis and Lapata 2018)
Utilize Adversarial Filtering (AF) to mitigate annotation artifacts and human biases in your datasets, thereby improving the reliability and validity of your studies. (Zellers et al. 2018)
Employ propensity score stratification to reduce bias from confounding factors when studying the impact of early alcohol usage on college success using longitudinal social media analysis. (Kiciman, Counts, and Gasser 2018)
Consider using text classification methods, specifically bag of words (BOW) and linear support vector machines (SVM) classifiers, to accurately predict court rulings, law areas, and dates of rulings in legal documents, while taking into account the potential impact of time periods on the textual form of case descriptions. (Şulea et al. 2017)
Focus on evaluating the performance of conversational models in real-world settings rather than solely relying on synthetic datasets, and consider incorporating customer profile features to enhance model performance. (Bordes, Boureau, and Weston 2016)
Consider using a distantly supervised model to identify dialectal language in social media, specifically African-American English (AAE), by leveraging demographics associated with geo-located messages. (Blodgett, Green, and O’Connor 2016)
Focus on analyzing counselor behaviors rather than individual conversations, as this approach provides a clearer picture of general conversation strategies and helps improve counselor training. (Althoff, Clark, and Leskovec 2016)
Focus on evaluating the performance of conversational models in real-world settings rather than solely on synthetic datasets, and consider incorporating customer profile features to enhance model performance. (Bordes, Boureau, and Weston 2016)
Utilize a combination of advanced deep learning techniques, such as LSTMs and sequence-to-sequence learning, along with innovative methods for semantic clustering and response set generation, to create effective systems for automated email response suggestion. (W. Chan et al. 2015)
Focus on addressing specific problems, architectures, and cognitive aspects of language, rather than solely pursuing improvements in state-of-the-art metrics on benchmark tasks. (Manning 2015)
Consider utilizing the stringdist package for efficient and accurate computation of various string distances and approximate text matching tasks across diverse platforms. (Mark 2014)
Focus on developing and validating appropriate automated text analysis methods tailored to specific research questions and datasets, rather than seeking a universally applicable solution. (Grimmer and Stewart 2013)
Adopt a model-based approach to avoid inefficiency and utilize shrinkage and regularization techniques to prevent overfitting when attempting to identify and analyze political content in texts. (Monroe, Colaresi, and Quinn 2008)
Consider using Chain Augmented Naive Bayes (CAN) models for text classification tasks, as they offer improved performance compared to traditional naive Bayes models while maintaining simplicity and allowing for the use of advanced smoothing techniques from statistical language modeling. (F. Peng, Schuurmans, and Wang 2004)
Carefully consider the impact of data preprocessing steps like removing duplicates or irrelevant folders, as well as the potential limitations of using thread information due to possible redundancy issues, when working with datasets like the Enron corpus for email classification tasks. (“Machine Learning: ECML 2004” 2004)
Employ a desk research approach, utilizing secondary sources of information, while maintaining flexibility in identifying relevant reference sources, and focusing on specific keywords to analyze the role of ChatGPT in enhancing student productivity in higher education. (NA?)
Explore the potential benefits and risks of integrating AI-generated text into academic writing and research processes, while ensuring proper monitoring, transparency, and adherence to ethical guidelines. (NA?)
Utilize full-length papers in addition to abstracts for information extraction tasks, as significant amounts of valuable information are often hidden in the body of the paper. (NA?)
Carefully consider the unique linguistic features of text-based asynchronous computer-mediated communication (TA-CMC) compared to other modes of communication, validate existing cues for deception detection in TA-CMC, and focus on objective, context-insensitive linguistics-based cues (LBC) for automating deception detection. (NA?)
Employ a comprehensive annotation scheme to accurately capture the nuances of opinions, emotions, sentiments, speculations, evaluations, and other private states in language, allowing for better understanding and analysis of these complex linguistic phenomena. (NA?)
Utilise a combination of term-counting and machine learning techniques to achieve higher levels of accuracy in sentiment classification tasks. (NA?)
Carefully map existing prompt engineering guidelines onto specific requirements engineering activities, considering both the advantages and limitations of doing so, to effectively leverage large language models in the field. (NA?)
Leverage ontological resources, integrate diverse text processing applications, and use an expanded pattern language that mixes syntactic and semantic elements and variable ordering when developing information extraction systems. (NA?)
Carefully consider the structural and content differences between abstracts and full-text articles when conducting biomedical text mining, as these differences can impact the performance of text mining tools and the extraction of certain data types. (NA?)
Adopt a modular, pipelined system design for natural language processing (NLP) tasks, allowing for mixing-and-matching of various algorithms and improving overall system robustness. (NA?)
Carefully engineer prompts for ChatGPT to optimize its performance in detecting plagiarism in simple programming exercises, as the choice of prompt significantly affects the accuracy of the model. (NA?)
Focus on developing models for recognizing humor and irony in social media using textual features, specifically those related to ambiguity, polarity, unexpectedness, and emotional scenarios. (NA?)
Utilize a combination of world knowledge, event extraction methods, and rule extraction and generalization techniques to effectively predict future events based on existing data. (NA?)
Consider incorporating multiple modalities (such as linguistic, audio, and visual features) in your sentiment analysis studies, as doing so can significantly improve the accuracy of your predictions. (NA?)
Focus on the potential benefits of ChatGPT for student learning and academic integrity, rather than solely focusing on the risks and negative consequences of its usage. (NA?)
Carefully consider the use of multiple prompt engineering techniques in creative tasks, as combining too many techniques may not necessarily enhance idea quality, and a targeted approach selecting specific techniques based on the desired outcome may be more effective. (NA?)
Utilize a combination of search techniques (such as bigram hashing and TF-IDF matching) and machine comprehension models (multi-layer recurrent neural networks) to effectively answer open-domain questions using Wikipedia as the primary knowledge source. (NA?)
Consider employing mixed-initiative interface designs when integrating large language models (LLMs) into functional user interfaces, as demonstrated through the successful implementation of OpenAIs Codex in an email client interface, resulting in a decrease in perceived workload and a 62.5% reduction in errors.’ (NA?)
Combine BERT-based deep learning approaches with parallel blocks of single-layer CNNs to improve the performance of fake news detection by capturing semantic and long-distance dependencies in sentences. (NA?)
Employ multiple language models, including lexical, IR-based, word2vec-based, and DL-based models, to comprehensively evaluate the correlation between requirements similarity and software similarity in the context of requirements-based code reuse. (NA?)
Employ an iterative methodology involving human-in-the-loop interaction between ChatGPT, Google Colab, and biomechanical models to generate accurate and efficient Python code for biomechanical simulations. (NA?)
Carefully evaluate the potential benefits and risks associated with integrating ChatGPT and other AI tools into your work, considering factors like academic integrity, privacy, cognitive biases, accessibility, commercialization, and ethical guidelines provided by organizations like UNESCO. (NA?)
Consider the limitations of current AI technology like ChatGPT in interpreting and answering complex medical questions, especially in high-stake situations like medical licensing exams, and recognize the need for continued improvement through deep learning. (NA?)
Employ AI-generated language models like ChatGPT to simulate doctor-patient consultations, thereby potentially improving patient education and satisfaction, while recognizing the limitations of AI in providing esoteric and personal advice. (NA?)
Explore the potential benefits of using prompt-based methods for contextual stance classification, as it offers a promising alternative to traditional supervised learning techniques, especially in situations where labeled training data is scarce. (NA?)
Carefully consider the limitations and potential biases of using AI tools like ChatGPT in scientific research, while acknowledging your benefits in terms of knowledge summarization and innovation efficiency. (NA?)
Carefully consider the selection of appropriate machine learning algorithms for text classification tasks in chatbot development, comparing your accuracies and choosing the optimal algorithm for your specific application. (NA?)
Maintain vigilance, integrate expert-driven fact-checking and verification processes, and encourage the development and implementation of open-source AI technology to address the concerns surrounding the use of large language models (LLMs) like ChatGPT in academic research. (NA?)
Employ multiple evaluators when assessing the quality of answers provided by AI language models such as ChatGPT, in order to minimize bias and improve the accuracy and reliability of the evaluation. (NA?)
Conduct a narrative review analyzing current research, opinions, and published literature on AI and ChatGPT in the educational sector, focusing on the opportunities and challenges presented by these technologies. (NA?)
Carefully consider the limitations and potential harms of using AI tools like ChatGPT for medical text simplification, emphasizing the importance of expert oversight and adaptation to the specific needs of the medical field. (NA?)
Carefully craft and optimize your prompts when engaging large language models (LLMs), taking into account factors such as priming, formatting, and uncertainty management, while also considering privacy concerns and the inherent limitations of these models. (NA?)
Employ effective prompt engineering methods to ensure accurate and reliable responses from generative language models (GLMs) in medical education applications. (NA?)
Utilize OpenAIs ChatGPT language model due to its scale, pre-training, versatility, efficiency, and quality, enabling it to generate scripts effectively in the field of cybersecurity.’ (NA?)
Consider employing a prompt engineering strategy that utilizes two different chemical string representation algorithms - one for the query and the other for the database - in order to improve the effectiveness of chemical similarity searches in identifying structurally distinct functional analogues. (NA?)
Focus on developing comprehensive evaluations that go beyond mere rote memorization and instead prioritize critical thinking, problem-solving abilities, and awareness of biases in order to better prepare future medical practitioners. (NA?)
Carefully consider your choice of graph-based natural language processing techniques when conducting studies involving text analysis and information retrieval. (NA?)
Carefully consider the ethical, legal, and practical implications of using large language models (LLMs) like ChatGPT in the peer review process, including issues around bias, confidentiality, and data privacy. (NA?)
Consider utilizing artificial intelligence language models (LLMs) as substitutes for human participants in studies, particularly when investigating specific topics with explicit situational features driving human judgements, and when employing particular tasks such as lengthy surveys that require rapid response times without fatigue. (NA?)
Consider integrating retrieval-augmented language models into your clinical workflow to enhance the reliability and accuracy of language model-based clinical decision-making support systems. (NA?)

Sentiment Analysis

Consider utilizing pre-trained language models like AfroXLMR when working on sentiment analysis tasks for African languages, as demonstrated by the success of the NLNDE team in achieving the best performance in both monolingual and multilingual classification tasks. (Muhammad et al. 2023)
Utilize diverse datasets, including COVID-specific hate terms, general anti-AAPI hate terms, anti-Chinese politics terms, and counter hate terms, to gain a comprehensive understanding of the complexities surrounding anti-Asian hate speech on Twitter before and during the pandemic. (H. Lin et al. 2022)
Carefully consider the role of replies in generating emotion and sentiment networks when conducting analyses on Twitter data. (Sailunaz 2018)
Focus on utilizing context incongruity as a primary factor in developing effective sarcasm detection models, as demonstrated by its success in improving F-scores by 10-20% compared to previous methods. (Abhijit Mishra et al. 2017)
Employ loopy belief propagation (LBP) in conjunction with a graph-based model to effectively propagate sentiments among entities, thereby enhancing sentiment analysis accuracy. (L. Deng and Wiebe 2014)
Integrate sentiment analysis techniques with interactive visual analytics to effectively mine and interpret vast amounts of unstructured user-generated data in social media, particularly during disasters and emergencies, to enhance situational awareness and improve crisis management. (Balahur et al. 2013)
Carefully consider the level of text granularity being examined, the potential influence of sentiment lexicons, the challenge of sentiment composition, the difficulties inherent in data annotation, the complexities of multilingual sentiment analysis, and the application of sentiment analysis to downstream applications. (Grefenstette et al. 2013)
Consider utilizing the Bidirectional Encoder Representations from Transformers (BERT) as a contextual language model in its multilingual version (mBERT) combined with Convolutional Neural Network (CNN) as a classifier for improved sentiment analysis performance on the Tunizian Arabizi dialectal dataset. (Serra, Araujo, and Santos 2012)
Employ a two-step process in phrase-level sentiment analysis, initially categorizing phrases as neutral or polar, followed by disambiguating the polarity of polar phrases, resulting in improved identification of contextual polarity compared to baseline methods. (Wilson, Wiebe, and Hoffmann 2009)
Utilize a combination of natural language processing techniques, such as n-grams and proximity analysis, along with traditional machine learning approaches to effectively extract and analyze product reviews from the vast amount of data available on the internet. (NA?)
Utilise graph-based models to effectively capture pairwise interactions between sentences when conducting sentiment analysis, thereby improving the accuracy of your predictions. (NA?)
Utilize bilingual knowledge and ensemble techniques when conducting unsupervised Chinese sentiment analysis, specifically by translating Chinese reviews into English reviews through machine translation services and then performing sentiment analysis on these translated reviews before combining the individual analysis results. (NA?)
Consider using a sociotechnical data mining approach, combining human evaluation of emotional content with large-scale text analysis, to accurately capture and quantify emotional states in populations. (NA?)
Consider combining rule-based classification, supervised learning, and machine learning into a hybrid method to improve the classification effectiveness of sentiment analysis tasks. (NA?)
Carefully consider the impact of document length and linguistic complexity on sentiment analysis performance, particularly in the context of microblogs and other short-form texts. (NA?)
Consider leveraging publicly available information on social media platforms, specifically Twitter, to predict users personality traits using machine learning techniques, thereby enabling improved understanding of individuals and potentially improving your overall experience with interfaces and social media tools.’ (NA?)
Carefully choose and combine sentiment analysis methods to optimize coverage and agreement in online social network data analysis. (NA?)
Consider combining multiple lexical resources and utilizing advanced machine learning algorithms, such as fuzzy c-means clustering and support vector machines, to achieve higher accuracy in sentiment analysis, emotion recognition, and personality detection tasks. (NA?)
Consider utilizing the free, user-friendly, and comprehensive Sentiment Analysis and Cognition Engine (SEANCE) tool for sentiment, social cognition, and social-order analysis, as it outperforms the popular yet paid Linguistic Inquiry and Word Count (LIWC) tool in various tests. (NA?)
Ensure adequate description of your approaches when publishing, so that others can accurately implement and reproduce your results. (NA?)
Perform a comprehensive benchmark comparison of sentiment analysis methods across multiple datasets to understand your strengths, weaknesses, and limitations in different contexts. (NA?)
Employ multiple machine learning algorithms, such as Naive Bayes, Max Entropy, and Support Vector Machine, alongside lexicon-based approaches, to effectively perform sentiment analysis on Twitter data, while considering the unique linguistic and structural characteristics of tweets. (NA?)
Focus on developing accurate and reliable corpora of sarcastic social media posts, incorporating both lexical and pragmatic factors, to improve machine learning algorithms for detecting sarcasm in online communication. (Nigam and Hurst, n.d.)
Utilize a double propagation method for simultaneous opinion lexicon expansion and target extraction, leveraging the natural relationships between opinion words and targets, while incorporating sentiment polarity assignment and noisy target pruning techniques for improved accuracy. (NA?)

Text Classification

Combine prompt fine-tuning and contrastive learning when developing a medical question classification system using ERNIE 3.0 as a feature extractor, to improve its performance and robustness. (“Proceedings of Third International Conference on Sustainable Expert Systems” 2023)
Consider incorporating psycholinguistic knowledge through a tripartite graph network when attempting to detect personality traits from online posts, as this approach allows for more accurate and efficient interactions between nodes within the graph. (Gjurković et al. 2020)
Utilize large, authentic, real-world datasets like “liar” to develop effective, broad-coverage fake news detectors, incorporating both textual content and associated metadata. (W. Y. Wang 2017)
Consider utilizing various text categorization techniques, such as Bag-of-Words, ELMo, BERT, and ULMFiT, to effectively analyze and interpret complex event data, particularly in the field of conflict studies. (Beieler 2016)
Carefully differentiate between conversational and informational questions in social Q&A sites, as they exhibit distinct characteristics in terms of writing quality, archival value, and structural properties, which can impact the validity and reliability of subsequent analyses. (NA?)
Consider combining manual assessment steps with automated topic detection and deep learning classification of Reddit data to accurately categorize mental health-related content and themes. (NA?)
Consider using real-world news stories as the basis for writing assignments in order to engage students in analyzing credible sources, describing complex phenomena, and collaborating across disciplines. (NA?)

Information Extraction

Consider combining large language models (LLMs) and small language models (SLMs) within a “filter-then-rerank” paradigm, where SLMs act as filters and LLMs as rerankers, to achieve significant improvements in few-shot information extraction tasks. (Y. Ma et al. 2023)
Aim to create comprehensive ontologies and datasets for international events, utilizing both human coders and natural language processing techniques, in order to achieve high levels of coverage, recall, and precision when analyzing historical episodes. (Douglass et al. 2022)
Utilise a two-stage transformation process involving clausal and phrasal disembedding layers to convert complex linguistic structures into hierarchical representations of core facts and associated contexts, thereby preserving semantic relationships and easing recognition of predicate-argument relations. (Cetto et al. 2018)
Consider adopting LexNLP, an open-source Python package specifically tailored for natural language processing and machine learning in legal and regulatory contexts, which offers functionalities such as document segmentation, information extraction, and model training, while being built on established libraries like NLTK and scikit-learn. (Bommarito, Katz, and Detterman 2018)
Utilise existing domain-specific lexical resources, such as part-of-speech tags, dictionary collocations, and named-entities, to enhance the accuracy of parsing in specialised fields like biomedical literature. (“Natural Language Processing – IJCNLP 2005” 2005)
Carefully consider the impact of various factors such as global corpus size, training set size, and document length on the performance of automatic keyphrase extraction algorithms. (Witten et al. 1999)
Employ a hybrid approach combining corpus statistics and linguistic heuristics to extract meaningful sub-compounds from complex noun phrases, thereby improving indexing for information retrieval systems. (Evans and Zhai 1996)
Consider employing active learning techniques in natural language processing tasks, specifically semantic parsing and information extraction, to effectively reduce the number of annotated examples required to achieve a high level of performance. (NA?)
Consider incorporating linguistic knowledge, such as syntax and part-of-speech (POS) tags, into your data representation when conducting automatic keyword extraction tasks. By doing so, they can achieve significant improvements in precision and overall performance, as demonstrated through various experimental comparisons. (NA?)
Focus on creating a publicly available dataset for fact-checking tasks, utilizing existing fact-checked statements from reliable sources, and addressing the challenges associated with context, time, speaker, multiple sources, and interpretations in order to advance the field of automated fact-checking. (NA?)
Carefully consider the complexity and specificity of the information they aim to extract, taking into account the text type, domain, and desired information, while balancing computational intensity and efficiency in developing your information extraction systems. (NA?)

Specialization: Text Classification, Summarization, And Generation

Text Classification Algorithms

Carefully consider the wording of your prompts when conducting generative classification tasks, as slight changes in phrasing can significantly affect the performance of the model. (Y.-S. Wang and Chang 2022)
Develop a machine classifier that can accurately identify online hate speech using data collected from Twitter in the immediate aftermath of trigger events, allowing for timely and effective policy decisions to mitigate potential social disruptions. (Burnap and Williams 2015)
Prioritize accurate estimation of document category proportions over individual document classification accuracy, particularly when working with unstructured text data in social sciences. (Hopkins and King 2009)

Automatic Summarization

Carefully consider the challenges of natural language processing when attempting to automatically generate event timelines from history textbooks, including issues related to implicit temporal mentions, entity co-reference resolution, event co-reference resolution, and normalization of entity names. (Adak et al. 2022)
Carefully select appropriate summarisation techniques and evaluate your performance against a high-quality Welsh summarisation dataset to effectively support the development of summarizers in minority language contexts. (Ezeani et al. 2022)
Utilize advanced statistical tools and machine learning algorithms to optimize the efficiency and accuracy of your study designs, particularly in areas where traditional methods might fall short. (P. Li, Bing, and Lam 2018)
Consider utilizing advanced natural language processing techniques, such as machine learning and deep neural networks, to effectively identify and extract essential information from large volumes of online text, thereby creating concise and meaningful summaries. (NA?)

Text Generation

Consider using in-context learning to guide large language models to debug code using a “print debugging” method, which involves inserting print statements to trace and analyze logs for fixing the bug, leading to improved performance in code generation tasks. (Xueyu Hu et al. 2024)
Use the chain-of-thought strategy with multi-step optimizations to carefully design prompts for ChatGPT, which can lead to significant improvements in code generation performance. (Haozhe Liu et al. 2023)
Utilize a prompt-based editing approach for text style transfer, rather than autoregressive generation, to improve control over the process and avoid error accumulation. (G. Luo et al. 2023)
Focus on developing efficient gradient-based optimization algorithms for learning hard text prompts, which offer the benefits of both soft prompts (automatic generation) and hard prompts (portability, flexibility, and simplicity) for controlling generative models. (Wen et al. 2023)
Carefully craft your prompts and utilize prompt engineering techniques to maximize the accuracy of Large Language Models (LLMs) in solving chemistry-related problems. (White et al. 2023)
Utilise large language models (LLMs) to generate synthetic text for supervised text analysis tasks, thereby addressing issues of transparency, reproducibility, and explainability associated with traditional methods. (Jankowski and Huber 2023)
Consider using a human-AI collaborative approach when creating datasets for complex natural language processing tasks such as polite language rewriting, where AI models like GPT-3.5 can significantly reduce human annotation time while maintaining high quality standards. (Xun Wang et al. 2022)
Consider using Generative Adversarial Networks (GANs) for improving the quality of text generation, especially when dealing with autoregressive language models or seq2seq models, as GANs explicitly train the generator to produce high quality samples and have shown great success in image generation. (Fedus, Goodfellow, and Dai 2018)
Consider utilizing a novel neural generative model that combines variational auto-encoders (VAEs) and holistic attribute discriminators for effective imposition of semantic structures when attempting to generate plausible text sentences with controlled attributes. (Z. Hu et al. 2017)

Automatic Speech Recognition And Synthesis

Focus on developing large-scale weakly supervised speech recognition models that can generalize well across multiple domains, tasks, and languages without requiring extensive fine-tuning or domain-specific adjustments. (Amodei et al. 2015)

Speech Recognition Systems

Consider using a generation-based method like SIG for speaker identification in literature, as it allows for easier integration of auxiliary tasks and supports out-of-domain evaluation, leading to improved performance compared to previous baselines and zero-shot ChatGPT. (Z. Su et al. 2023)
Use a combination of encoding models, language models, and beam search algorithms to efficiently decode continuous language from non-invasive fMRI brain recordings, enabling accurate reconstruction of heard or imagined stimuli in real-time. (Tang et al. 2022)
Develop a miscellaneous-context-based method inspired by conceptual graphs to convert sentences into directed graphs for improved reading comprehension and semantic interpretation. (W.-H. Lin and Lu 2020)
Focus on developing a stack-propagation framework for spoken language understanding (SLU) tasks, which enables the integration of intent semantic knowledge to guide slot filling and improve the interpretability of the joint model. (L. Qin et al. 2019)
Consider implementing transfer learning methods when working with low-resource RNN-T models, as it can lead to improved performance and stability during training. (Arsikere, Sapru, and Garimella 2019)
Utilise multiple machine learning models to identify areas of strength and weakness within a dataset, allowing for better understanding of the datasets suitability for benchmarking purposes.’ (P. Shah et al. 2018)
Carefully examine the potential impact of preceding sounds on the perception of subsequent sounds, particularly in the context of speech production and perception. (Mann 1980)
Consider using computational strategies, specifically analyzing the statistical distributions of sounds that children hear in ambient language, to understand how infants develop language-specific patterns of listening. (NA?)
Consider utilizing advanced natural language processing (NLP) techniques to improve the accuracy and reliability of readability formulas, as these methods have been shown to outperform traditional readability formulas in numerous studies. (NA?)

Speech Synthesis Technologies

Focus on developing a two-step approach to text-to-speech conversion, involving separate processes for converting text to high-level semantic tokens and then to low-level acoustic tokens, allowing for greater efficiency and flexibility in handling diverse speech data. (Kharitonov et al. 2023)
Consider incorporating both phoneme and grapheme representations of text as input, along with word-level alignment between them, in order to enhance the performance of neural TTS models by producing more natural prosody and accurate pronunciation. (Jia et al. 2021)
Consider adopting a multilingual approach to zero-shot multi-speaker TTS tasks, as demonstrated by the success of the YourTTS model in achieving state-of-the-art results in zero-shot multi-speaker TTS and comparable results in zero-shot voice conversion across multiple languages. (Pratap et al. 2020)

Question Answering Systems

Consider utilizing In-Context RALM, a simple yet effective technique that enhances language modeling performance by appending relevant documents to the input without requiring any further training of the language model. (Ram et al. 2023)
Utilise a combination of coarse labels and heuristic spans to effectively upsample coarse document labels to fine-grained labels or spans, particularly in areas like social sciences where precise data is hard to obtain. (Halterman and Radford 2021)
Utilise a multi-task modelling approach when attempting to integrate complex hierarchies of information, such as the if-then’ relation types presented here, into neural network models. This approach leads to more accurate inference compared to models trained in isolation, as demonstrated by experimental results.’ (Sap et al. 2018)
Utilize a combination of search techniques (such as bigram hashing and TF-IDF matching) along with a multi-layer recurrent neural network model to effectively identify answers within Wikipedia articles for open-domain question answering. (D. Chen et al. 2017)
Consider using the SearchQA dataset for evaluating your question-answering algorithms because it provides a more realistic representation of the full pipeline of general question-answering, incorporating information retrieval and answer synthesis, and demonstrates a significant gap between human and machine performance. (Dunn et al. 2017)
Focus on developing models capable of integrating information across multiple documents to enhance machine comprehension capabilities, as demonstrated by the authors creation of two datasets (WikiHop and MedHop) that require multi-hop reasoning.’ (Welbl, Stenetorp, and Riedel 2017)
Consider using the SearchQA dataset for evaluating your question-answering algorithms because it provides a more realistic representation of the full pipeline of general question-answering, incorporating information retrieval and answer synthesis, and demonstrates a significant gap between human and machine performance. (Dunn et al. 2017)
Utilize the MS MARCO dataset for machine reading comprehension and question-answering tasks due to its large scale, real-world nature, and variety of question types and difficulties. (Bajaj et al. 2016)
Use sophisticated syntactic and semantic structures, enhanced with Linked Open Data (LOD) knowledge, in order to accurately assess the impact of various factors on answer passage reranking tasks. (Tymoshenko and Moschitti 2015)
Leverage graph convolutional networks (GCNs) to capture relationships among entities in documents, and incorporate bi-directional attention between nodes and queries to enhance query-aware node representation for multi-hop reasoning question answering tasks. (Bahdanau, Cho, and Bengio 2014)
Adopt a hierarchical classifier guided by a layered semantic hierarchy of answer types to improve the accuracy of question classification in open-domain question answering tasks. (NA?)

Semantic Analysis And Understanding

Carefully consider the impact of prompt style on the performance of large language models like ChatGPT in complex NLP tasks like event extraction, as it can lead to significant variations in results obtained by different users. (Jun Gao et al. 2023)
Differentiate between formal linguistic competence (knowledge of linguistic rules and patterns) and functional linguistic competence (understanding and using language in the world) when evaluating large language models (LLMs), as they may excel in one aspect but struggle in another. (Mahowald et al. 2023)
Consider balancing diversity and similarity in your demonstration selection strategy for semantic parsing tasks, as this approach can lead to improved performance in tasks like Text-to-SQL. (Nan et al. 2023)
Utilize the extensive LCC Metaphor Datasets, which offer a comprehensive resource for metaphor research, featuring metaphoricity ratings, scored links to source and target concept domains, and ratings for affective polarity and intensity, across multiple languages. (“Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics” 2020)
Utilize the proposed Abductive Natural Language Inference (αNLI) and Abductive Natural Language Generation (αNLG) tasks to evaluate the effectiveness of language-based abductive reasoning models, particularly focusing on the ability to handle incomplete observations and generate plausible explanations. (Bhagavatula et al. 2019)
Focus on developing a novel approach called “SAR” that combines seed-based and unsupervised adversarial learning methods to effectively map APIs across languages with minimal parallel corpora. (Bui, Yu, and Jiang 2019)
Utilise a combination of Wikipedia and Common Crawl data to train your word vectors, as this allows for higher quality word representations due to the increased volume and diversity of the data. (Grave et al. 2018)
Be aware of potential annotation artifacts in natural language inference datasets, which can lead to overestimation of model performance and misinterpretation of results. (Gururangan et al. 2018)
Consider using Bayesian models of annotation for analyzing crowdsourced data in Natural Language Processing, as these models offer improved performance compared to traditional methods such as majority voting and inter-annotator coefficients of agreement. (Paun et al. 2018)
Utilize the Multi-Genre Natural Language Inference (MultiNLI) corpus when developing and evaluating machine learning models for sentence understanding, as it offers broader coverage and increased difficulty compared to previous datasets, allowing for better assessments of model performance. (Williams, Nangia, and Bowman 2017)
Use the Dutch FrameNet (DFN) annotation tool to generate a rich linguistic dataset that combines both referential and frame annotations, allowing them to identify and understand the various ways in which real-world event instances are framed within and across documents. (Noord and Bos 2017)
Focus on creating a unified framework for combining multiple semantic components, allowing for an extrinsic evaluation of these modules and potentially improving various natural language processing applications like machine translation, summarization, generation, and question answering. (Agirre et al. 2014)
Posit a separate level indicating the event structures associated with predicates and your arguments, as this enables a deeper understanding of the relationship between syntax, semantics, and event structure in natural languages. (Kreiner and Eviatar 2014)
Utilise multiple distributional methods (such as PPMI, SVD, and SGNS) when studying semantic change over time, as each method offers unique strengths and weaknesses depending on the type of semantic change being investigated. (Yoon Kim et al. 2014)
Carefully consider the mass-count distinction when creating axioms for a formalized knowledge base using WordNet, as it affects the inferential relationships between concepts and impacts the precision and reliability of the knowledge represented. (Gordon and Schubert 2013)
Carefully select and annotate a diverse range of clinical texts to create a comprehensive and valuable semantically annotated corpus for developing and evaluating information extraction systems in healthcare. (Roberts et al. 2009)
Carefully evaluate the quality of word frequency norms being utilised, considering factors like corpus size, language register, and the definition of the frequency measure, and ideally adopt a new and improved word frequency measure like the SUBTL frequency norms from the SUBTLEXUS corpus. (Brysbaert and New 2009)
Consider adopting a Bayesian inference approach to understand how individuals can effectively generalize meaning from a limited number of examples, without assuming that words are mutually exclusive or mapped solely onto basic-level categories. (F. Xu and Tenenbaum 2007)
Utilize the online common-sense knowledge base, HowNet, to explore inter-conceptual and inter-attribute relationships within lexicons of both Chinese and English languages, thereby facilitating a deeper understanding of the nuances of meaning across cultures and languages. (Dong and Dong 2006)
Consider utilizing the PMI-IR algorithm over LSA for tasks involving synonym recognition, as it demonstrates superior performance on the TOEFL and ESL tests. (Turney 2002)
Carefully consider the choice between the joint task (syntactic dependency parsing and semantic role labeling) and the SRL-only task (using provided syntactic dependency parses) when evaluating natural language processing models, as the former allows for a more comprehensive analysis of model performance. (Sakai et al. 1995)
Consider using the supervaluation-based approach to address vagueness in natural language, as it offers a way to manage vagueness without having to abandon core principles of logic. (“Logic and Lexicon” 1995)
Consider the impact of context on the interpretation and memory of idiomatic expressions, as it influences the ease of comprehension and recall, suggesting that the distinction between literal and metaphoric language is better understood as a continuum between conventional and unconventional utterances. (NA?)
Carefully consider and utilize various lexical properties, including frequency, length, part of speech, and semantic features, when selecting and analyzing words for psycholinguistic studies. (NA?)
Consider using event-related potentials (ERPs) to investigate the cognitive processes underlying language comprehension, as ERPs can provide insights into the neural mechanisms responsible for integrating syntactic and semantic information during sentence processing. (NA?)
Utilise a combination of Latent Semantic Analysis (LSA) and Construction-Integration (CI) models to create a high-dimensional semantic space for analysing metaphor comprehension. (NA?)
Consider combining multiple methods, such as best-first clustering, alternative training set creation, and refined string match features, to achieve statistically significant gains in precision for coreference resolution tasks. (NA?)
Prioritize semantic validity when grouping semantic types, ensuring that the groups are semantically coherent, parsimonious, complete, exclusive, natural, and useful for some purpose. (NA?)
Carefully consider the choice of evaluation metrics when comparing the performance of semantic textual similarity (STS) models across different datasets, as different metrics may lead to varying interpretations of model effectiveness. (NA?)
Utilise similarity-based models to enhance your probability estimates for unseen word combinations in natural language processing tasks, as demonstrated through improvements in language modelling and pseudo-word disambiguation tasks. (NA?)
Carefully define structural complexity and address Matsumotos objection to ensure accurate prediction of conversational inferences.’ (NA?)
Consider combining multiple lexical association measures to improve the accuracy of collocation extraction, as evidenced by the authors empirical findings demonstrating significant improvements in performance through various combination methods.’ (NA?)
Develop a medium-depth, phrase-based semantic NLP tool for the language of chemical experiments, utilizing a modular architecture and combining OSCAR, domain-specific regex, and English taggers to identify parts-of-speech, and employing ANTLR grammar to structure this into tree-based phrases. (NA?)
Consider using TAALES 2.0, a tool that provides a broad array of indices related to word and (n)-gram frequency and range, (n)-gram strength of association, contextual distinctiveness, word recognition norms, semantic network, and word neighbors, to analyze and understand various aspects of language development and proficiency. (NA?)
Employ the Structural Topic Model (STM) for analyzing multilingual textual data in comparative politics, as it offers a flexible way to incorporate metadata associated with the text, such as when it was written, where it was written, who wrote it, and characteristics of the author, into the analysis using document-level covariates, thereby allowing researchers to understand relationships between metadata and topics in your text corpus. (NA?)
Carefully consider the implications of your choice of word association dataset, taking into account factors such as the number of cues, the number of responses per cue, and the representativeness of the sample population, as these choices can significantly impact the reliability and generalizability of findings. (NA?)
Assess bias at the contextual word level rather than just the sentence level, as this approach captures the contextual effects of bias while avoiding confounding effects that underestimate bias at the sentence encoding level. (NA?)
Utilise word embeddings as a quantitative lens to analyse historical trends, particularly in relation to gender and ethnic stereotypes, as they can accurately capture societal changes and offer a valuable complementary perspective alongside traditional linguistic and sociological approaches. (NA?)
Consider using a knowledge-aware prompt-tuning approach with synergistic optimization for relation extraction tasks, as it effectively leverages semantic and structural knowledge among relation labels and reduces the need for domain expertise in prompt template selection. (NA?)
Utilize tree kernels for natural language processing tasks due to your ability to generate numerous syntactic features and allow learning algorithms to choose the most relevant ones for a particular application, despite your initial computational complexity being superlinear in the number of tree nodes. (NA?)

Word Embedding Methods

Consider using Predictive Text Embedding (PTE) for text classification tasks, as it combines the strengths of unsupervised text embeddings and convolutional neural networks, resulting in improved efficiency, scalability, and reduced sensitivity to model parameters. (NA?)

Semantic Role Labeling

Consider optimizing a graph-based parser that treats the alignment and graph segmentation as latent variables, allowing for simultaneous induction of both components during training. (Dohare, Karnick, and Gupta 2017)
Adopt a support vector machine (SVM) classifier for semantic parsing tasks, as it performs well on text classification tasks and allows for efficient training and testing processes. (NA?)
Prioritize the integration of syntactic parsing information in the early stages of semantic role labeling, particularly during the pruning phase, to achieve optimal performance. (NA?)

Relation Extraction

Utilise pre-trained language representations rather than explicit linguistic features when conducting relation extraction tasks. This approach offers several benefits including reduced reliance on annotated language resources, decreased potential for error accumulation due to less explicit feature extraction, and enhanced sample efficiency. (Alt, Hübner, and Hennig 2019)
Carefully consider the importance of feature selection and engineering in improving the performance of your machine learning models, as demonstrated by the surprising finding that a simpler classifier trained on similar features performed comparably to a more complex neural network system. (Joulin et al. 2016)
Focus on developing methods for detecting and classifying events, anchoring events temporally, and identifying and classifying explanatory relations between events in order to effectively analyze and interpret news stories. (Mostafazadeh et al. 2016)
Combine distant and partial supervision for relation extraction by providing partial supervision to a distantly supervised relation extractor using a small number of carefully selected examples, resulting in improved performance. (“Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)” 2014)
Adopt the expressed-at-least-once’ assumption rather than the ‘distant supervision’ assumption when dealing with relation extraction tasks, particularly when the training knowledge base is an external source of information. (NA?)

Machine Learning In Nlp

Utilize Deductive Closure Training (DCT) to enhance the coherence, accuracy, and updatability of language models by employing the models themselves to recognize implications and contradictions within the text they produce, thereby enabling efficient self-supervised refining. (Akyürek et al. 2024)
Consider employing advanced machine learning techniques, specifically GPT-3, to operationalize contextual predictability in your studies, as it provides the best account of N400 amplitude and suggests that seemingly diverse N400 effects of expectancy, plausibility, and contextual semantic similarity can be reduced to variations in the predictability of words. (Michaelov et al. 2024)
Consider employing an autonomous agent to instruct the reasoning process of large language models in order to enhance your zero-shot reasoning abilities on general language understanding tasks. (Crispino et al. 2023)
Consider utilizing emerging chain-of-thought (CoT) reasoning techniques in large language models (LLMs) to enhance both predictive performance and explainability, especially when dealing with complex tasks. (Hebenstreit et al. 2023)
Account for the unique challenges posed by large language models (LLMs) when conducting regression testing, such as different correctness notions, prompting brittleness, and non-determinism in LLM APIs. (W. Ma, Yang, and Kästner 2023)
Consider treating large language models as latent variable models, enabling them to develop algorithms for selecting optimal demonstrations for in-context learning, leading to improved performance across various natural language processing tasks. (Xinyi Wang et al. 2023)
Carefully examine the generalizability of language models to new task variants, specifically focusing on counterfactual tasks that maintain the core reasoning procedure but change the input-output mappings, to determine the extent to which the models performance is due to transferable, generalizable reasoning skills or condition-specific behaviors.’ (Z. Wu et al. 2023)
Consider using a prompt-based adversarial attack (PromptAttack) to effectively assess the adversarial robustness of large language models (LLMs) by converting adversarial textual attacks into an attack prompt that causes the victim LLM to output the adversarial sample, while preserving the original semantic meanings of the adversarial examples through a fidelity filter and enhancing the attack power by ensembling adversarial examples at different perturbation levels. (X. Xu et al. 2023)
Consider incorporating multimodal information sources, specifically combining language and visual data, into your experimental designs to enhance the validity and reliability of your findings. (Z. Zhang et al. 2023)
Utilise the GitTables dataset, a large-scale corpus of 1 million relational tables extracted from CSV files in GitHub repositories, to improve the performance of deep learning models in various data management tasks, such as data search and preparation, by providing a more accurate representation of typical database tables. (Hulsebos, Demiralp, and Groth 2023)
Focus on developing task-specific adapters and multi-token label embeddings to improve the efficiency and accuracy of few-shot learning without relying on handcrafted prompts and verbalizers. (Mahabadi et al. 2022)
Focus on developing effective strategies for creating contrastive data sets and optimizing your corresponding learning objectives in order to improve the performance of natural language processing models across various tasks. (Miller 2021)
Consider combining multiple model compression techniques, such as parameter quantization and perfect hashing, to significantly reduce the memory footprint of natural language understanding models while maintaining minimal predictive performance impact. (Strimel, Sathyendra, and Peshterliev 2018)
Consider employing a specialization-generalization training strategy based on prompt learning to disentangle general matching signal learning and specific task combination, allowing for enhanced multi-task generalization abilities in text matching models. (NA?)
Consider using ontology-enhanced prompt-tuning (OntoPrompt) when working on few-shot learning (FSL) projects involving pre-trained language models (PLMs), as it addresses challenges related to knowledge noise and heterogeneity. (NA?)

Supervised Learning

Consider using ensemble methods that combine the output of successful, separately developed modules to create more accurate solutions for natural language problems, as this approach outperforms any individual module alone. (Turney et al. 2003)

Unsupervised Learning

Leverage ChatGPT for text data augmentation in order to enhance the performance of few-shot learning text classification tasks, as evidenced by its ability to generate more diverse and accurate augmented samples. (Dai et al. 2023)
Consider utilizing open-source large language models (LLMs) combined with powerful rerankers to effectively generate synthetic query-document pairs for training information retrieval systems, leading to significant improvements in performance. (Jeronymo et al. 2023)
Consider using a unified multilingual prompt, such as UniPrompt, for zero-shot cross-lingual transfer of prompt-based tuning in order to effectively leverage the capabilities of pretrained language models (PLMs) across multiple languages without requiring separate prompt designs for each language. (L. Huang et al. 2022)
Incorporate the “ordered sequence of terms” assumption into your information retrieval models, allowing them to utilize advancements in statistical natural language processing and potentially improve the performance of your models. (Hiemstra 1998)

Reinforcement Learning

Consider implementing the GRATH algorithm, which uses Direct Preference Optimization (DPO) to iteratively refine truthfulness data and update the model, resulting in a gradual improvement in model truthfulness in a self-supervised manner. (W. Chen, Song, and Li 2024)
Focus on improving the quality and diversity of your instruction sets through the use of advanced techniques such as Instruction Fusion, which combines multiple seed instructions into a single, more complex prompt, rather than relying solely on traditional evolutionary approaches. (Weidong Guo et al. 2023)
Consider the impact of varying amounts of instruction data on model performance, especially in real-world use cases, as it can lead to continuous improvements in tasks such as open-ended generation, while remaining relatively stable in tasks like math and code. (Ji et al. 2023)
Carefully inspect the application of prompt engineering and calibration techniques on smaller language models, as your individual benefits may vary depending on the specific model used, and your combined effect tends to be largely negative. (C. Ma 2023)
Adopt a leader-follower bilevel framework to optimize the prompt-generation policy and action-policy simultaneously, thereby improving the efficiency and accuracy of large language models in decision making tasks. (Yan et al. 2023)
Utilise a suite of diagnostics derived from human language experiments to gain a deeper understanding of the linguistic capacities of pre-trained language models, such as BERT, and to identify areas of improvement. (Ettinger 2019)

Computational Linguistics

Develop a comprehensive benchmark for media bias detection, called MBIB, which covers nine distinct tasks and 22 datasets, allowing for better comparison and evaluation of models in a standardized way. (Wessel et al. 2023)
Consider the role of cultural transmission in understanding the evolution of language, as it can significantly alter the relationship between innate learning biases and linguistic behavior, leading to the emergence of strong universals even with weak innate biases. (Kirby, Dowman, and Griffiths 2007)
Carefully select appropriate cleaning stages and corpus subsets to ensure accurate representation of the target population and reduce noise in your analysis. (NA?)

Syntax And Parsing

Utilize web-scale corpora, specifically the DepCC corpus, for improved performance in natural language processing tasks such as verb similarity, as demonstrated through its superior results on the SimVerb3500 dataset when compared to smaller corpora like Wikipedia. (Panchenko et al. 2017)
Utilize large eye-tracking corpora like GECO to explore various aspects of language processing, particularly in bilingual populations, as it provides a rich source of data for understanding the complexity of reading behaviors and the interactions between different language processes. (Calvo and Meseguer 2002)
Utilize computational simulation informed by theoretical linguistics to better understand and explain real linguistic data in terms of the underlying processes driving human language. (Kirby 2002)
Carefully consider the structural differences between multiple-fronting languages, particularly regarding the placement of Wh-words in SpecCP, as this impacts the interpretation and comparison of results across languages. (NA?)
Consider the potential impact of embodied relations on language comprehension, specifically examining the role of spatial iconicity in shaping word order patterns and influencing response times during semantic judgments. (NA?)
Consider employing a variety of parsing strategies, including different directions (forward or backward), learners (MaxEnt or SVM), and search strategies (best-first or deterministic), to achieve improved performance in dependency parsing tasks. (NA?)
Utilize syntactic n-grams (sn-grams) over traditional n-grams in machine learning tasks, as sn-grams are based on syntactic relationships rather than surface structure, allowing for more accurate and interpretable results. (NA?)
Utilize probabilistic context-free grammars (PCFGs) to effectively perform statistical constituency parsing, which involves assigning probabilities to different parse trees and selecting the one with the highest probability to accurately interpret ambiguous sentences. (n.d.)

Morphology And Lexical Analysis

Carefully choose the appropriate tokenization method for your specific language and application, considering factors like morphology, vocabulary size, and downstream task performance. (Toraman et al. 2023)
Consider adopting a universal tagging schema and data formats to enable efficient integration of data from various sources, while maintaining consistency and accuracy in the representation of linguistic information. (Kirov et al. 2018)
Be aware of the impact of text preprocessing choices on unsupervised learning models, as these choices can significantly influence the results and interpretations drawn from the data. (“Replication Data for: Not so Harmless After All: The Fixed-Effects Model” 2017)
Adopt a pragmatic approach to Chinese word segmentation, defining words based on your usage in practical applications rather than relying solely on traditional linguistic definitions. (Jianfeng Gao et al. 2005)
Consider exploring rule-based tagging for part of speech identification tasks, as it offers several advantages over stochastic tagging methods, including improved portability, less storage space requirement, easier modification, and potentially equal or superior performance. (NA?)
Consider applying the maximum entropy model, a statistical machine-learning algorithm, to Chinese word segmentation tasks, as it demonstrates high levels of precision and recall rates (95.01% and 94.94% respectively) when trained on a 237K-word dataset. (NA?)

Discourse Analysis

Avoid conflating Grices project of analyzing the structure of communication with Relevance Theory’s aim of modeling the cognitive processes underlying interpretation, as each approach addresses distinct aspects of language comprehension.’ (NA?)

Multilinguality And Crosslingual Transfer

Code Switching And Mixing

Consider incorporating linguistic and social perspectives when studying code-switching (C-S) in language technologies, as current massive language models struggle to accurately represent diverse C-S types due to lack of appropriate training data, robust evaluation benchmarks, and end-to-end systems that account for sociolinguistic aspects of C-S. (Doğruöz et al. 2023)

Evaluation And Assessment Techniques

Investigate the redundancy of large language models (LLMs) outputs, particularly focusing on identifying instances where LLMs generate unnecessary calculations and reasoning, which could potentially hinder your overall performance. (Chiang and Lee 2024)
Consider the CRUD-RAG framework when developing and evaluating retrieval-augmented generation (RAG) systems, as it allows for a more comprehensive assessment across various application scenarios, including creating, reading, updating, and deleting. (Lyu et al. 2024)
Prioritise creating a geographically and temporally balanced dataset to accurately evaluate the factuality of large language models (LLMs) and identify potential biases, thereby promoting global inclusivity and fairness in computational systems. (Mirza et al. 2024)
Consider using Gradient-Based Red Teaming (GBRT) as an efficient and scalable method for generating diverse prompts that effectively identify weaknesses in generative language models, leading to improved model alignment and evaluation. (Wichers, Denison, and Beirami 2024)
Conduct rigorous fairness evaluations for each intended clinical use case of large language models (LIMs) like GPT-4 to prevent perpetuating or amplifying health disparities. (Zack et al. 2024)
Develop a series of tasks to assess the ability of large language models (LLMs) to parse, understand, analyze, and create knowledge graphs using Turtle syntax, and integrate these tasks into an automated evaluation system like LLM-KG-Bench to gain insights into the strengths and limitations of LLMs in handling formal languages within knowledge graph engineering workflows. (Arndt et al. 2023)
Compare different approaches such as pre-training, fine-tuning, and prompt engineering techniques to determine the optimal method for completing novel tasks with limited data, especially in the field of large language models. (Addlesee et al. 2023)
Carefully consider the selection of prompts when applying prompt-based learning methods to detect biases in language models, as the choice of prompts can greatly impact the models ability to accurately identify and mitigate biases.’ (Aowal et al. 2023)
Ensure that your experimental setups are rigorous and unbiased, allowing for fair and accurate evaluation of the models performance.’ (Bordt and Luxburg 2023)
Carefully evaluate and document the limitations and biases present in large language models like ChatGPT, particularly in terms of reasoning, factual accuracy, math, coding, and bias, in order to better understand your strengths and weaknesses and guide improvements in future iterations. (Borji 2023)
Consider using ChatGPT for tasks requiring an understanding of sentence-level relations, especially causal relations, but acknowledge its limitations in handling temporal and implicit discourse relations. (C. Chan et al. 2023)
Develop a comprehensive understanding of factuality across diverse domains, rather than solely focusing on world knowledge, to effectively evaluate the accuracy of large language models. (S. Chen et al. 2023)
Consider utilizing large language models (LLMs) as an alternative to human evaluation for assessing the quality of texts, as LLMs can effectively mimic human evaluators and offer stable results across various formatting and sampling methods. (Chiang and Lee 2023)
Utilize the AI Occupational Exposure (AIOE) methodology, originally proposed by Felten et al (2018, 2021), to evaluate the influence of advanced language models like ChatGPT on different professions, industries, and regions. (Felten, Raj, and Seamans 2023)
Leverage the emerging capabilities of generative pre-trained language models, specifically your zero-shot instruction and in-context learning abilities, to develop a novel evaluation framework called GPTScore. This framework enables customized, multi-faceted, and training-free evaluation of generated texts, addressing long-standing challenges in text evaluation. (Fu et al. 2023)
Develop comprehensive evaluation suites, such as C-Eval, to accurately assess the advanced knowledge and reasoning abilities of foundation models in a specific linguistic and cultural context, allowing for targeted improvements and fostering growth for users in that region. (Y. Huang et al. 2023)
Carefully consider prompt wording when deploying large language models (LLMs) for downstream tasks, as GPT-3 responses are shown to be inconsistent and unreliable across different prompts and settings. (Khatun and Brown 2023)
Consider using large language models like GPT-4 with chain-of-thoughts (CoT) and a form-filling paradigm to achieve better alignment with human judgment when evaluating the quality of natural language generation (NLG) outputs. (Y. Liu, Iter, et al. 2023)
Move away from dataset-driven practices that focus on specific dimensions and types of biases, towards a more holistic approach that considers the diversity of cultures and languages across the globe. (Ramesh, Sitaram, and Choudhury 2023)
Carefully evaluate the performance of large language models (LLMs) on math word problems (MWPs) by analyzing your responses under varying conditions, such as requiring them to show your work or not, and assessing the influence of factors like the number of unknowns and operations on the likelihood of failure. (Shakarian et al. 2023)
Utilise the Tensor Trust dataset to explore the vulnerability of large language models (LLMs) to prompt injection attacks, specifically focussing on the two types of attacks - prompt extraction and prompt hijacking. (Toyer et al. 2023)
Consider the importance of adversarial and out-of-distribution robustness when evaluating the performance of AI systems like ChatGPT, particularly in safety-critical scenarios. (J. Wang et al. 2023)
Leverage large language models like ChatGPT to efficiently and cost-effectively assess the reliability of news domains, given your strong correlation with human expert judgements. (K.-C. Yang and Menczer 2023)
Carefully analyze the relationship between the capabilities of large language models (LLMs) and your vulnerabilities to indirect prompt injection attacks, and subsequently develop appropriate defense mechanisms to mitigate these risks. (Yi et al. 2023)
Consider using human-centric benchmarks, such as AGIEval, when evaluating the performance of foundation models in order to obtain a more accurate representation of your capabilities in real-world scenarios. (Zhong et al. 2023)
Carefully consider the impact of epistemic markers, such as expressions of certainty, uncertainty, or evidentiality, on language models, as they can significantly influence model accuracy and performance. (K. Zhou, Jurafsky, and Hashimoto 2023)
Employ Prompt Risk Control (PRC), a framework for selecting a prompt based on rigorous upper bounds on families of informative risk measures, to reduce the risk of generating unexpectedly poor responses in large language models, particularly for the worst-off users. (Zollo et al. 2023)
Consider leveraging AI tools like ChatGPT and DALL-E to enhance various aspects of your work, such as discovery and search, research assistance, reference services, teaching, textbook creation, information literacy and digital literacy, writing and creation, plagiarism detection, copyright management, productivity improvement, and equity and inclusion promotion. (“Tools Such as ChatGPT Threaten Transparent Science; Here Are Our Ground Rules for Their Use” 2023)
Develop a customised benchmark, named FaiRLLM, to assess the fairness of recommendation systems based on large language models (RecLLM), given the unique challenges posed by these systems. (Jizhi Zhang et al. 2023)
Carefully examine the types of problems for which code generation models tend to fail, and explore prompt engineering as a strategy for resolving errors, while considering the ethical implications and risks associated with the rapid increase in deployment of such models. (Denny, Kumar, and Giacaman 2022)
Consider the potential for adversarial attacks on transformer-based large language models (LLMs) through prompt injection, specifically focusing on goal hijacking and prompt leaking, and develop appropriate defense mechanisms accordingly. (Perez and Ribeiro 2022)
Consider employing a prompt-based adversarial attack strategy to effectively probe the vulnerabilities of pre-trained language models (PLMs) and subsequently enhance your robustness through a prompt-based adversarial training method. (Z. Yang et al. 2022)
Adopt the CheckList methodology for comprehensive behavioral testing of NLP models, which involves creating a matrix of general linguistic capabilities and test types to ensure thorough evaluation and identification of critical failures. (Ribeiro et al. 2020)
Carefully examine the limitations of current NLI systems in handling simple lexical inferences, and explore ways to enhance your generalization abilities through improved integration of lexical and world knowledge. (Glockner, Shwartz, and Goldberg 2018)
Develop and utilize new, system- and data-independent automatic evaluation methods for Natural Language Generation (NLG) systems, since current metrics like BLEU only weakly correlate with human judgments and are data- and system-specific. (B. Peng et al. 2017)
Utilize Wizard of Oz studies to understand the unique nature of man-machine interaction in natural language processing, as opposed to solely relying on human-human dialogue data. (NA?)
Consider developing a methodology to identify and categorize negative citations in order to gain deeper insights into the dynamics of scientific communication and collaboration. (NA?)

Ethical Considerations

Focus on developing a nuanced understanding of the complex interplay between large generative AI models (LGAIMs), your developers, deployers, and users, and the associated ethical, legal, and societal implications, rather than solely focusing on the technical aspects of these models. (Hacker, Engel, and Mauer 2023)
Carefully examine the impact of assigning personas to language models, as doing so can significantly increase toxicity levels and perpetuate harmful stereotypes. (Deshpande et al. 2023)
Carefully select and validate your chosen benchmarks for measuring stereotype bias and discrimination in language models, and consider developing custom benchmarks tailored to your specific research goals. (Ganguli et al. 2023)
Carefully evaluate the potential legal and ethical risks associated with developing and deploying foundation models based on copyrighted content, and explore technical mitigations to ensure compliance with fair use principles. (Henderson et al. 2023)
Utilize retrieval-based methods to effectively detect AI-generated text, as opposed to relying solely on statistical properties or watermarking, which can be easily evaded through paraphrasing. (Krishna et al. 2023)
Consider the potential impact of stochastic parrots and hallucination in large language models like ChatGPT, which could lead to unverified information generation and subsequent ethical and legal challenges. (Shuai Li et al. 2023)
Focus on developing and testing multi-step jailbreaking prompts to effectively extract personally identifiable information (PII) from large language models (LLMs) like ChatGPT, despite your enhanced dialog safety features. (H. Li et al. 2023)
Utilize a sampling-based approach called “SelfCheckGPT” to detect hallucinations in generative large language models like GPT-3, which involves comparing multiple sampled responses from the model to measure information consistency and determine if statements are factual or hallucinated. (Manakul, Liusie, and Gales 2023)
Develop a test suite like XSTest to systematically identify exaggerated safety behaviors in large language models, which involves creating safe prompts that well-calibrated models should not refuse and unsafe prompts as contrasts that models should refuse. (Röttger et al. 2023)
Utilise soft-prompt tuning for bias evaluation of large language models, particularly for sentiment classification tasks, as it allows for fine-grained analysis and understanding of the models bias towards under-represented groups, while reducing the risk of injecting human bias through manual prompt design.’ (Tian et al. 2023)
Focus on understanding the inherent limitations of alignment processes in large language models, particularly in regards to the potential for adversarial prompting attacks, and develop robust mechanisms to ensure AI safety. (Wolf et al. 2023)
Thoroughly explore and document the ethical challenges faced by large language models (LLMs) in real-world applications, focusing on aspects such as bias, reliability, robustness, and toxicity, and then propose strategies to mitigate these issues. (Zhuo et al. 2023)
Carefully evaluate the potential benefits and drawbacks of using ChatGPT as a language learning tool, considering factors such as its technical capabilities, pedagogical limitations, and content accuracy, while remaining aware of ethical concerns associated with AI usage. (Barrot 2023)
Avoid using large language models like ChatGPT as co-authors or incorporating AI-generated text into your submissions due to ethical concerns raised by several prominent academic journals. (Flanagin et al. 2023)
Acknowledge the use of AI chatbots in your studies, ensure transparency in your work, and collaborate with relevant stakeholders to develop clear ethical guidelines for integrating chatbots into scientific publications. (Ali and Djalilian 2023)

Prompt Engineering And Optimization

Carefully engineer your prompts to optimize communication with generative AI models, taking into consideration the models capabilities and limitations, and utilizing advanced techniques such as chain of thought prompting and affordances to guide the model toward a desired outcome.’ (Amatriain 2024)
Consider using proxy-tuning, a lightweight decoding-time algorithm that operates on top of black-box LMs, to achieve the result of directly tuning the model without accessing its internal weights, thereby enabling efficient customization of large pretrained LMs for diverse users and applications. (A. Liu et al. 2024)
Adopt a multi-phase approach to prompt engineering, involving multiple assessors and clear criteria, to improve the reliability, objectivity, and transparency of large language model outputs in scientific research. (C. Shah 2024)
Adopt the meta-prompting technique to improve the performance of language models by breaking down complex tasks into smaller subtasks, assigning them to specialized expert models, and coordinating your outputs through a central conductor model. (Suzgun and Kalai 2024)
Focus on optimising prompt engineering to elicit meaningful and accurate responses from AI language models by defining the objective, understanding the models capabilities, being clear and concise, providing context and examples, fine tuning and debugging prompts, specifying the format, including key details, testing and iterating, and considering safety and ethics.’ (Bozkurt and Sharma 2023)
Carefully engineer prompts for large language models to optimize your effectiveness, considering factors such as clarity, precision, role-playing, and the use of advanced techniques like chain-of-thought and tree-of-thoughts prompting. (Banghao Chen et al. 2023)
Actively select the most uncertain questions for annotation when developing chain-of-thought prompting strategies for large language models, as this leads to improved performance on complex reasoning tasks. (Diao et al. 2023)
Carefully engineer your prompts to ensure clarity, precision, relevance to learning objectives, stimulation of critical thinking, incorporation of practical applications, and provocation of reflection and self-assessment in order to optimize learning outcomes in medical and nursing education. (Heston 2023)
Employ Differentially-Private Offsite Prompt Tuning (DP-OPT) to create privacy-preserving prompts for cloud-hosted Large Language Models (LLMs), while maintaining data confidentiality, information privacy, and model ownership. (Hong et al. 2023)
Consider implementing a code-level self-prompt Zero-shot CoT (SelfzCoT) methodology for better utilization of large language models (LLMs) in multi-step reasoning tasks, as it significantly improves accuracy over existing state-of-the-art approaches. (Lei and Deng 2023)
Carefully consider the linguistic properties of prompts when working with large language models, as these properties can greatly impact model performance, and there is no clear correlation between performance and factors like perplexity, word frequency, ambiguity, or prompt length. (Leidinger, Rooij, and Shutova 2023)
Consider using a Large Language Model (LLM) as a generator in your experiments, as it allows for global optimization of prompts and ensures coherence in the generated texts. (Y. B. Li and Wu 2023)
Optimize prompt position in addition to focusing on prompt vocabulary selection and embedding initialization, as it significantly impacts model performance in natural language processing tasks. (J. Mao, Middleton, and Niranjan 2023)
Consider prompt engineering as an inverse problem, allowing them to automatically optimize prompts for large language models (LLMs) to achieve desired behavior and improve overall performance. (Melamed et al. 2023)
Focus on developing visual analytics systems like PromptAid to interactively create, refine, and test prompts through exploration, perturbation, testing, and iteration, thereby helping non-expert users to efficiently improve the performance of large language models. (Aditi Mishra et al. 2023)
Adopt a declarative prompt engineering’ approach to optimize the use of Large Language Models (LLMs) in data processing workflows, drawing on principles from the declarative crowdsourcing literature to achieve greater efficiency and accuracy.’ (Parameswaran et al. 2023)
Carefully examine the safety risks introduced by fine-tuning aligned large language models, as even seemingly innocuous adjustments can potentially compromise the safety alignment of these models. (Qi et al. 2023)
Consider using Synthetic prompting, a method that leverages a few handcrafted examples to prompt a large language model to generate more examples by itself, and selects effective demonstrations to elicit better reasoning, leading to improved performance on various reasoning tasks. (Shao et al. 2023)
Combine the pFlat metric with existing metrics like Mutual Information (MI) and Sensitivity (Sen) to improve the performance and sample efficiency of prompt selection for large language models. (L. Shen et al. 2023)
Focus on developing and testing automated methods for generating optimal prompts for large language models (LLMs) in order to improve your reasoning capabilities across various domains. (F. Shi et al. 2023)
Explore the use of ControlPE (Continuously Controllable Prompt Engineering) to enable finer adjustments to prompt effects, complementing existing prompt engineering, and effectively controlling continuous targets. (Y. Sun et al. 2023)
Consider adopting the “Self-Align” method for developing AI assistants, which uses a combination of principle-driven reasoning and the generative power of large language models to achieve self-alignment with minimal human supervision, thus improving efficiency, reducing bias, and increasing control. (Z. Sun et al. 2023)
Carefully design and optimize prompts for downstream tasks in order to maximize the performance of large language models in the medical domain. (Y.-J. Wang et al. 2023)
Consider applying the LLE-INC method for tuning-free manifold-based space re-embedding in your work, as it effectively preserves local properties within the same class as guidance for classification, leading to improved performance in prompt-based tuning. (H. Wang et al. 2023)
Include the purpose and target audience in prompts when using ChatGPT for translation tasks, as doing so can lead to higher quality translations that better match industry standards. (Yamada 2023)
Consider using the RRHF approach for aligning large language models with human preferences because it simplifies the training process, reduces the need for multiple models, and achieves comparable performance to PPO while requiring less hyperparameter tuning. (Z. Yuan et al. 2023)
Utilize the concept of Conversation Regression Testing to systematically evaluate and refine prompt strategies for chatbot development, enabling them to effectively address errors and ensure robustness and generalizability. (Zamfirescu-Pereira, Hartmann, and Yang 2023)
Develop comprehensive, reliable, and automated evaluation benchmarks for detecting and mitigating hallucination in large language models, considering the unique challenges posed by massive training data, versatility of LLMs, and imperceptibility of errors. (Y. Zhang et al. 2023)
Focus on developing effective training methodologies that can enhance model performance under limited data availability, particularly when dealing with complex, multi-word relation labels in relation classification tasks. (W. Zhang et al. 2023)
Carefully control the type and amount of evidence provided in the prompt when evaluating the effectiveness of ChatGPT in answering complex health information questions, as incorrect evidence can significantly reduce the models accuracy.’ (Zuccon and Koopman 2023)
Consider aggregating the predictions of multiple effective, yet imperfect, prompts to improve prompting performance over a broad set of models and tasks. (Arora et al. 2022)
Utilize few-shot prompt learning to efficiently harness the capabilities of large language models for model completion tasks, thereby eliminating the need for extensive training or fine-tuning on large datasets. (Chaaben, Burgueño, and Sahraoui 2022)
Consider using prompt learning techniques for clinical decision tasks, as they can provide comparable or improved performance compared to traditional fine-tuning methods, while reducing computational resource costs and training data requirements. (Taylor et al. 2022)
Utilise Automatic Prompt Engineer (APE) for automatic instruction generation and selection, treating the instruction as the “program”, optimised by searching over a pool of instruction candidates proposed by an LLM in order to maximise a chosen score function, and evaluating the quality of the selected instruction through the zero-shot performance of another LLM following the selected instruction. (Y. Zhou et al. 2022)
Explore and analyze the vast zero-shot knowledge hidden within large language models (LLMs) before creating fine-tuning datasets or few-shot exemplars, as LLMs possess impressive zero-shot reasoning abilities demonstrated by the Zero-shot-CoT method. (Black et al. 2021)
Consider employing prompt engineering techniques to enhance the performance of existing AI for code models, rather than relying solely on fine-tuning or additional data acquisition. (Mark Chen et al. 2021)
Carefully consider the choice of pre-trained language models, prompt engineering techniques, answer engineering approaches, and multi-prompt learning strategies when implementing prompt-based learning methods in natural language processing. (P. Liu et al. 2021)
Carefully craft prompts for large language models to elicit desired emotional responses and improve the performance of chatbots in handling emotionally charged interactions. (NA?)
Consider utilizing prompt-based prototyping with large language models to reduce barriers of access, speed up the prototyping process, and improve communication among collaborators, while acknowledging the challenges associated with reverse engineering prompt designs, sourcing example data, debugging, and evaluating prompt effectiveness. (NA?)
Develop a systematic method to automatically align user intentions with the specific prompt preferences of each large language model (LLM) in natural language processing (NLP) applications, leading to enhanced performance across various downstream tasks. (NA?)
Carefully consider the choice of examples, token length, and ordering within the prompt when conducting prompt engineering for large language models like Codex, as these factors significantly impact the quality of generated code. (NA?)
Use automatic prompt engineering techniques to generate diverse natural language text, which can then be utilized to create optimal prompt templates for various tasks, thereby enabling large language models to effectively solve those tasks. (NA?)
Carefully select and engineer appropriate prompt templates and answer sets to enable accurate predictions from pre-trained language models across diverse natural language processing tasks. (NA?)
Optimize prompt engineering for natural language generation (NLG) output that has hermeneutic value for individual users, considering hermeneuticity to be subjectively determined by the reader and aiming for output that encourages critical reflection on personal assumptions and worldviews. (NA?)
Consider employing the “prompt-tuning” paradigm for pre-training language models (PLMs) in order to enhance your performance in medical text classification tasks. (NA?)
Consider employing prompt-based fine-tuning instead of standard fine-tuning for text classification tasks in low-resource languages like Urdu and Roman Urdu, as it significantly improves accuracy by up to 13% compared to traditional approaches. (NA?)
Consider integrating trusted knowledge sources into traditional language models to enhance your accuracy and reliability in addressing domain-specific queries. (NA?)
Focus on creating effective prompts for large language models (LLMs) using techniques such as chain-of-thought, few-shot learning, template usage, and prompt tuning to maximize accuracy, efficiency, and creativity in generating outputs across various domains. (NA?)
Focus on addressing the challenge of effectively utilizing pre-training knowledge in prompt learning for building foundation models when developing large-scale pre-trained and fine-tuned models. (NA?)
Consider implementing the proposed Soft Prompt Construction (SPC) framework to enhance cross-domain generalization capabilities in language models. (NA?)
Carefully consider the use of multi-turn dialogue prompts when working with GPT-3.5 for machine translation tasks, as it significantly improves the translation quality of the model. (NA?)

Chatbots And Dialogue Systems

Consider incorporating large-scale language models (LLMs) in your dialogue systems, as demonstrated by the success of the team that utilized them in the competition, and focus on effectively using real-time information to enhance the performance of your systems. (Minato et al. 2024)
Carefully consider the choice between shared and separate contexts when using ChatGPT for software testing education, as shared context tends to yield more accurate answers and explanations. (Jalil et al. 2023)
Use a small set of expert-written conversations as in-context examples to synthesize a social conversation dataset using prompting, allowing them to generate high-quality conversational data without the need for extensive human annotation. (Maximillian Chen et al. 2023)
Continuously monitor the behavior of large language models (LLMs) like GPT-3.5 and GPT-4 over time, as your performance and behavior can vary significantly across different versions, potentially causing issues in integrating them into larger workflows and reproducing results. (L. Chen, Zaharia, and Zou 2023)
Consider using TikTok data to understand students perspectives on ChatGPT, as it provides valuable insights into your interests and concerns, and offers a unique viewpoint compared to traditional survey methods.’ (Haensch et al. 2023)
Consider leveraging pre-existing audio foundation models instead of training multi-modal LLMs from scratch when developing systems for understanding and generating audio modality in spoken dialogues. (R. Huang et al. 2023)
Carefully consider the implications of personalization in large language models, balancing the benefits of increased user satisfaction and engagement against the potential risks of reinforcing individual biases, creating echo chambers, and compromising social cohesion. (Kirk et al. 2023)
Conduct a comprehensive evaluation of ChatGPT and similar large language models across multiple languages and tasks to understand your capabilities and limitations in multilingual NLP applications. (Lai et al. 2023)
Carefully consider the potential biases in GPT detectors against non-native English writers, and strive to develop more robust and equitable detection methods that take into account the linguistic nuances of non-native authors. (W. Liang et al. 2023)
Carefully consider the potential impact of linguistic ambiguity on natural language processing (NLP) systems, particularly in terms of lexical, syntactic, and semantic ambiguity, and develop appropriate methods to address these issues in order to enhance the accuracy and reliability of NLP applications. (Ortega-Martín et al. 2023)
Conduct formative user interviews to understand user perceptions and challenges associated with prompting large language models, and subsequently design interactive systems like PromptMind to streamline the iterative process of prompt exploration and refinement for improved chatbot responses. (G. Su, Yang, and Guo 2023)
Use large language models (LLMs) to recursively generate summaries as memory, allowing the LLM to efficiently update its knowledge base and generate more consistent responses in long-term conversations. (Q. Wang et al. 2023)
Consider the potential impact of ChatGPT on various industries, particularly in areas like scientific writing, education, and medicine, while addressing the associated challenges such as technical limitations, misuse, ethical concerns, and regulatory policies. (C. Zhang et al. 2023)
Carefully evaluate the reliability and accuracy of AI-generated content, such as ChatGPT, before integrating it into your academic writing, and consider implementing measures to maintain scientific rigor and transparency. (Alkaissi and McFarlane 2023)
Consider unifying the four tasks in multi-goal conversational recommender systems (MG-CRS) into the same sequence-to-sequence (Seq2Seq) paradigm, allowing for better integration and understanding of the complexities inherent in MG-CRS. (Y. Deng et al. 2022)
Carefully consider the interaction modalities, knowledge elements, and computational tasks involved in developing conversational recommender systems (CRS) to ensure effective and engaging user experiences. (Jannach et al. 2021)
Consider using an incremental graph parsing algorithm to dynamically infer social relations and individual attributes from dialogues, enabling accurate tracking of evolving social interactions and improved understanding of human language. (Hui Chen et al. 2020)
Carefully consider and address various types of annotation errors in dialogue state tracking tasks, including delayed markups, multi-annotations, mis-annotations, typos, forgotten values, and inconsistencies between slot values and ontology, through a combination of manual and automated corrections. (Eric et al. 2019)
Consider both Intellectual Quotient (IQ) and Emotional Quotient (EQ) while designing social chatbots, focusing on user engagement and defining the success metric as conversation-turns per session (CPS). (Shum, He, and Li 2018)
Focus on developing end-to-end models for negotiation tasks, utilizing techniques such as self-play and dialogue rollouts to optimize performance. (Bahdanau, Cho, and Bengio 2014)
Aim to develop a generic dialogue shell for practical dialogues, which are focused on accomplishing specific tasks, as opposed to attempting to replicate full human conversational competence. (ALLEN et al. 2000)
Carefully consider the choice of system prompts when modifying Large Language Models for specific tasks, such as acting as an AI Psychologist, to optimize your performance and suitability for the intended domain. (NA?)
Carefully consider the unique characteristics of each dialogue system class (task-oriented, conversational agents, and interactive question answering) when selecting and implementing evaluation methods, as these characteristics significantly affect the suitability and performance of different evaluation techniques. (NA?)
Employ a descriptive study design to compare the performance of ChatGPT with that of health sciences faculty students in answering anatomy course questions, using a multiple-choice test comprising 40 questions on the covered material. (NA?)
Exercise careful judgement and rigorous human oversight when using AI tools like ChatGPT in scientific writing, ensuring transparency about your use and avoiding reliance on them for core research tasks. (NA?)
Use a structured narrative prompt to ensure transparency, consistency, and traceability when transforming agent data into natural-sounding narratives, allowing for effective sentiment analysis and comparison with real tweets. (NA?)
Carefully consider the trustworthiness, value, and potential dangers of AI-generated health information, particularly when comparing it to traditional sources like Google, and acknowledge the current limitations of such systems, such as outdated data, lack of transparency, and occasional hallucinations. (NA?)
Carefully engineer prompts to maximize the accuracy and consistency of GPT-4s responses in medical applications, particularly for strong recommendations where the ROT style demonstrated the highest overall consistency.’ (NA?)
Carefully engineer your prompts to include context, define symbols, specify desired format and structure, provide background information, apply constraints and limitations, and iterate refinements to optimize the accuracy and reliability of ChatGPTs responses.’ (NA?)

Named Entity Recognition And Disambiguation

Consider using a template-free approach called Entity-oriented Language Model (EntLM) fine-tuning for few-shot Named Entity Recognition (NER) tasks, as it offers improved efficiency and accuracy compared to traditional template-based methods. (R. Ma et al. 2021)
Focus on developing a comprehensive understanding of the complex interplay between context features, mentions, entities, and knowledge graphs in order to effectively solve the named entity linking problem. (W. Shi et al. 2020)
Leverage free open data sources like DBpedia and Wikipedia to automatically generate labeled datasets for Named Entity Recognition (NER) tasks, thereby reducing the need for expensive human-annotated datasets. (Menezes, Savarese, and Milidiú 2019)
Carefully consider how to effectively link events and locations within text data for accurate analysis and interpretation. (Halterman 2019)
Use optimal transport theory to dually optimize entity-level and group-level losses in cross-lingual entity alignment, improving alignment accuracy. (Pei, Yu, and Zhang 2019)
Prioritize using full-text articles rather than abstracts alone for text mining tasks, as doing so leads to improved accuracy and performance in identifying biologically relevant associations. (Westergaard et al. 2017)
Utilize multiple techniques from machine learning and natural language processing to develop effective named entity recognition (NER) systems for accurately identifying biological entities in text, taking into account the challenges posed by ambiguous terminology, inconsistencies in nomenclature, and complex multi-word names. (Leser and Hakenberg 2005)
Focus on developing a robust semantic interpretation framework for accurately identifying the correct senses of complex domain terms and your relationships, which is crucial for improving ontology development, document retrieval, and multilingual communication. (NA?)
Utilize a combination of methods, including dictionary generation, occurrence detection, and filtering of matches, to accurately identify and distinguish between protein and gene names within biomedical texts. (NA?)
Focus on developing a comprehensive feature set that effectively represents the task at hand, combining both basic orthographic and character-based predicates with domain-specific expert knowledge, such as gene and protein lexicons, to enhance the overall performance of the conditional random field (CRF) model. (NA?)
Focus on creating a high-quality, manually annotated text corpus for chemical entity recognition, ensuring that it covers diverse chemical disciplines and follows strict annotation guidelines to improve the accuracy and consistency of chemical entity identification. (NA?)
Consider implementing a joint machine learning model for simultaneous named entity recognition (NER) and normalization during both training and prediction phases, as it leads to improved performance compared to traditional sequential pipelines. (NA?)

Word Embeddings And Sense Disambiguation

Consider combining low-level semantic processing tasks like word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection with high-level natural language processing tasks to create innovative solutions and advance the field of computational linguistics. (R. Mao et al. 2023)
Utilize a unified evaluation framework for Word Sense Disambiguation tasks, which involves standardizing datasets and training corpora into a uniform format, semi-automatically converting annotations to WordNet 3.0, and applying consistent preprocessing pipelines. (Park, Shin, and Lee 2022)
Leverage the power of pre-trained language models like BERT to improve the accuracy of ontology subsumption predictions, particularly when dealing with complex ontologies expressed in languages like OWL. (J. Chen et al. 2022)
Move beyond treating words as discrete entities and instead represent them as vectors, enabling better understanding of semantic relationships between words and improved performance in natural language processing tasks. (Smith 2020)
Incorporate weak-supervision directly at the word sense level, instead of operating solely at the word form level, to improve lexical understanding in natural language processing tasks. (Levine et al. 2019)
Consider using a combination of manual, semi-automatic, automatic, and collaborative methods to create sense-annotated corpora for various languages and resources, such as WordNet, Wikipedia, and BabelNet, in order to improve the quality and quantity of available data for research and evaluation purposes. (Pasini and Camacho-Collados 2018)
Leverage BabelNet, a multilingual lexicalized semantic network, to create a large-scale high-quality corpus of sense-annotated textual definitions by combining definitions from different resources and languages, and refining the disambiguation output with a distributional approach based on semantic similarity. (Camacho-Collados et al. 2018)
Utilize a comprehensive framework to understand and address predictive biases in NLP systems, including recognizing four major sources of bias: label bias, selection bias, model overamplification, and semantic bias. (Blodgett and O’Connor 2017)
Carefully control for the choice of pre-trained word embeddings and the handling of out-of-vocabulary tokens at test time when comparing different architectures for reading comprehension tasks, as these factors can have a greater impact on performance than architectural choices. (Dhingra et al. 2017)
Employ a knowledge-based +/-effect coarse-grained sense disambiguation method based on selectional preferences modeled via topic models to accurately analyze implicit sentiment in text. (Pang, Lee, and Vaithyanathan 2002)
Recognize that word senses are not fixed entities, but rather depend on the specific purpose and context of the task at hand. (Kilgarriff 1997)
Focus on identifying architectural explanations for the parsers observed structural preferences, which can lead to deeper understanding of the parsing machinery and its design.’ (NA?)

Advances In Artificial Intelligence And Nlp

Integrate AI technologies, specifically ChatGPT, into your studies to enhance students learning effectiveness, distribute educational resources more evenly, and improve the overall quality of education.’ (Dempere et al. 2023)
Utilise neural machine translation to convert internal state-action representations of an autonomous agent into natural language, allowing for more accurate and human-friendly explanations of the agents behaviour.’ (Ehsan et al. 2017)

Deep Learning Advances

Identify a feature X such that (i) large language models have X, and (ii) if a system has X, then it is probably conscious, while providing good reasons for (i) and (ii) to explore the possibility of consciousness in AI systems. (Chalmers 2023)
Carefully construct false-belief tasks and include true-belief controls to ensure accurate evaluation of large language models ability to infer unobservable mental states.’ (Kosinski 2023)
Focus on developing algorithms that enforce local typicality in language generation, as this approach leads to higher-quality text with fewer degenerate repetitions. (Breiman 1957)

Advancements In Machine Translation

Consider employing “pivot prompting” - a novel approach whereby ChatGPT is asked to translate the source sentence into a high-resource pivot language (such as English) before translating it into the target language. This method was found to significantly enhance the translation performance for distant languages, making it a valuable tool for future research in machine translation. (Jiao et al. 2023)
Consider using large language models like GPT-3.5 and above for translation quality assessment, as they achieve state-of-the-art accuracy in comparison to human labels. (Kocmi and Federmann 2023)
Adopt prompt-based fine-tuning with informative evidence to improve the performance of critical error detection (CED) in English-Korean translation. (Bérard, Calapodescu, and Roux 2019)
Carefully assess the similarity of term-document matrices (TDMs) and topic model outputs derived from gold standard and machine-translated texts to ensure minimal loss of information during cross-language comparisons. (Vries, Schoonvelde, and Schumacher 2018)
Consider utilizing synchronous tree substitution grammar (STSG) for learning non-isomorphic tree mappings in machine translation tasks, as it permits local distortion of tree topology and can be extended to train on pairs of forests, allowing for greater flexibility in handling complex language structures. (Eisner 2003)
Consider adopting a joint source-channel model for machine transliteration tasks, as it enables direct orthographical mapping (DOM) between two different languages, leading to improved transliteration accuracy compared to traditional methods. (NA?)
Consider utilizing the Bible as a massive parallel corpus for natural language processing tasks, particularly for low-resource languages, due to its wide range of translations and unique identification of verses allowing for automatic, unambiguous alignment across languages. (NA?)
Consider utilising a combination of deep reinforcement learning and explicit lexical simplification techniques within an encoder-decoder model to optimise the quality of sentence simplification results. (NA?)
Use direct assessments (DA) instead of relative ranking (RR) when evaluating machine translation quality, as DA correlates strongly with RR and offers advantages like evaluating absolute translation quality and enabling quality-controlled crowd-sourcing. (NA?)
Utilize Multidimensional Quality Metrics (MQM) for more accurate and reliable quality assessment of machine translation outputs, particularly when dealing with low-resource language pairs. (NA?)

Emerging Trends And Future Directions

Carefully evaluate the potential impact of large language models (LLMs) on the labor market by developing a rubric to assess the exposure of tasks to LLMs, taking into consideration both human expertise and GPT-4 classifications. (Eloundou et al. 2023)

Ethical Implications

Carefully consider the potential benefits and drawbacks of using large language models (LLMs) like ChatGPT in your work, taking into account both the evolutionary and revolutionary perspectives on your capabilities, while ensuring adherence to established standards of scientific integrity. (Gordijn and Have 2023)

Resources And Datasets

Consider creating and maintaining a massive corpus for low-resource languages, such as Ukrainian, to provide a strong foundation for natural language processing tasks, enabling the development of contemporary language models and word embeddings, ultimately improving the performance of numerous downstream tasks. (Chaplynskyi 2023)
Utilise the NusaCrowd platform to access and leverage its extensive range of standardised Indonesian language datasets, thereby facilitating improved performance in Natural Language Processing tasks. (Altaher et al. 2022)
Leverage the unique characteristics of news article revision histories, specifically your ability to reflect updates to rapidly changing events, to improve existing NLP tasks and explore new ones. (Spangher and May 2021)
Utilize a balanced corpus, specifically the “Balanced Corpus of Contemporary Written Japanese” (BCCWJ), to ensure accurate representation and diversity in your studies of the Japanese language. (NA?)
Utilize large eyetracking corpora of natural reading to better understand and evaluate language models that go beyond the word level, enabling examination of numerous variables at various processing levels and your interactions, ultimately improving the generalizability of findings. (NA?)

Publicly Available Datasets

Carefully consider the principles of design, data and metadata collection, transcription and processing when constructing a corpus, making transparent any necessary compromises due to practical constraints. (“Compiling and Analysing the Spoken British National Corpus 2014” 2017)

References

n.d. https://doi.org/10.1371/journal.pmed.0050201.t001.

Abukhalaf, Seif, Mohammad Hamdaqa, and Foutse Khomh. 2023. “On Codex Prompt Engineering for OCL Generation: An Empirical Study.” arXiv. https://doi.org/10.48550/ARXIV.2303.16244.

Adak, Sayantan, Altaf Ahmad, Aditya Basu, and Animesh Mukherjee. 2022. “Placing (Historical) Facts on a Timeline: A Classification Cum Coref Resolution Approach.” arXiv. https://doi.org/10.48550/ARXIV.2206.14089.

Addlesee, Angus, Weronika Sieińska, Nancie Gunson, Daniel Hernández Garcia, Christian Dondrup, and Oliver Lemon. 2023. “Multi-Party Goal Tracking with LLMs: Comparing Pre-Training, Fine-Tuning, and Prompt Engineering.” arXiv. https://doi.org/10.48550/ARXIV.2308.15231.

Agirre, Eneko, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2014. “SemEval-2014 Task 10: Multilingual Semantic Textual Similarity.” Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014). https://doi.org/10.3115/v1/s14-2010.

Agrawal, Sweta, and Marine Carpuat. 2022. “An Imitation Learning Curriculum for Text Editing with Non-Autoregressive Models.” Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). https://doi.org/10.18653/v1/2022.acl-long.520.

Ahmed, Toufique, Supriyo Ghosh, Chetan Bansal, Thomas Zimmermann, Xuchao Zhang, and Saravan Rajmohan. 2023. “Recommending Root-Cause and Mitigation Steps for Cloud Incidents Using Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2301.03797.

Akyürek, Afra Feyza, Ekin Akyürek, Leshem Choshen, Derry Wijaya, and Jacob Andreas. 2024. “Deductive Closure Training of Language Models for Coherence, Accuracy, and Updatability.” arXiv. https://doi.org/10.48550/ARXIV.2401.08574.

Akyürek, Afra Feyza, Sejin Paik, Muhammed Yusuf Kocyigit, Seda Akbiyik, Şerife Leman Runyun, and Derry Wijaya. 2022. “On Measuring Social Biases in Prompt-Based Multi-Task Learning.” arXiv. https://doi.org/10.48550/ARXIV.2205.11605.

Ali, Mohammad Javed, and Ali Djalilian. 2023. “Readership Awareness Series – Paper 4: Chatbots and ChatGPT - Ethical Considerations in Scientific Publications.” Seminars in Ophthalmology 38 (March). https://doi.org/10.1080/08820538.2023.2193444.

Alkaissi, Hussam, and Samy I McFarlane. 2023. “Artificial Hallucinations in ChatGPT: Implications in Scientific Writing.” Cureus, February. https://doi.org/10.7759/cureus.35179.

ALLEN, JAMES, DONNA BYRON, MYROSLAVA DZIKOVSKA, GEORGE FERGUSON, LUCIAN GALESCU, and AMANDA STENT. 2000. “An Architecture for a Generic Dialogue Shell.” Natural Language Engineering 6 (September). https://doi.org/10.1017/s135132490000245x.

Alt, Christoph, Marc Hübner, and Leonhard Hennig. 2019. “Improving Relation Extraction by Pre-Trained Language Representations.” arXiv. https://doi.org/10.48550/ARXIV.1906.03088.

Altaher, Yousef, Ali Fadel, Mazen Alotaibi, Mazen Alyazidi, Mishari Al-Mutairi, Mutlaq Aldhbuiub, Abdulrahman Mosaibah, et al. 2022. “Masader Plus: A New Interface for Exploring +500 Arabic NLP Datasets,” August. http://arxiv.org/abs/2208.00932v1.

Althoff, Tim, Kevin Clark, and Jure Leskovec. 2016. “Large-Scale Analysis of Counseling Conversations: An Application of Natural Language Processing to Mental Health.” Transactions of the Association for Computational Linguistics 4 (December). https://doi.org/10.1162/tacl_a_00111.

Alturayeif, Nora, Hamzah Luqman, and Moataz Ahmed. 2023. “A Systematic Review of Machine Learning Techniques for Stance Detection and Its Applications.” Neural Computing and Applications 35 (January). https://doi.org/10.1007/s00521-023-08285-7.

Amatriain, Xavier. 2024. “Prompt Design and Engineering: Introduction and Advanced Methods.” arXiv. https://doi.org/10.48550/ARXIV.2401.14423.

Amin, Mostafa M., Erik Cambria, and Björn W. Schuller. 2023. “Will Affective Computing Emerge from Foundation Models and General AI? A First Evaluation on ChatGPT.” arXiv. https://doi.org/10.48550/ARXIV.2303.03186.

Amodei, Dario, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, et al. 2015. “Deep Speech 2: End-to-End Speech Recognition in English and Mandarin.” arXiv. https://doi.org/10.48550/ARXIV.1512.02595.

Anand, Yuvanesh, Zach Nussbaum, Adam Treat, Aaron Miller, Richard Guo, Ben Schmidt, GPT4All Community, Brandon Duderstadt, and Andriy Mulyar. 2023. “GPT4All: An Ecosystem of Open Source Compressed Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2311.04931.

Angelidis, Stefanos, and Mirella Lapata. 2018. “Summarizing Opinions: Aspect Extraction Meets Sentiment Prediction and They Are Both Weakly Supervised.” arXiv. https://doi.org/10.48550/ARXIV.1808.08858.

Aowal, Md Abdul, Maliha T Islam, Priyanka Mary Mammen, and Sandesh Shetty. 2023. “Detecting Natural Language Biases with Prompt-Based Learning.” arXiv. https://doi.org/10.48550/ARXIV.2309.05227.

Arif, Taha Bin, Uzair Munaf, and Ibtehaj Ul-Haque. 2023. “The Future of Medical Education and Research: Is ChatGPT a Blessing or Blight in Disguise?” Medical Education Online 28 (February). https://doi.org/10.1080/10872981.2023.2181052.

Arndt, Natanael, Kurt Junghanns, Claus Stadler, and Felix Brei. 2023. “AKSW/LLM-KG-Bench: 1.1.0,” September. https://doi.org/10.5281/ZENODO.8366061.

Arora, Simran, Avanika Narayan, Mayee F. Chen, Laurel Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic Sala, and Christopher Ré. 2022. “Ask Me Anything: A Simple Strategy for Prompting Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2210.02441.

Arsikere, Harish, Ashtosh Sapru, and Sri Garimella. 2019. “Multi-Dialect Acoustic Modeling Using Phone Mapping and Online i-Vectors.” Interspeech 2019, September. https://doi.org/10.21437/interspeech.2019-2881.

Asare, Owura, Meiyappan Nagappan, and N. Asokan. 2022. “Is GitHub’s Copilot as Bad as Humans at Introducing Vulnerabilities in Code?” arXiv. https://doi.org/10.48550/ARXIV.2204.04741.

Baek, Jinheon, Nirupama Chandrasekaran, Silviu Cucerzan, Allen herring, and Sujay Kumar Jauhar. 2023. “Knowledge-Augmented Large Language Models for Personalized Contextual Query Suggestion.” arXiv. https://doi.org/10.48550/ARXIV.2311.06318.

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2014. “Neural Machine Translation by Jointly Learning to Align and Translate,” September. http://arxiv.org/abs/1409.0473v7.

Bajaj, Payal, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, et al. 2016. “MS MARCO: A Human Generated MAchine Reading COmprehension Dataset.” arXiv. https://doi.org/10.48550/ARXIV.1611.09268.

Balahur, Alexandra, Ralf Steinberger, Mijail Kabadjov, Vanni Zavarella, Erik van der Goot, Matina Halkia, Bruno Pouliquen, and Jenya Belyaeva. 2013. “Sentiment Analysis in the News.” arXiv. https://doi.org/10.48550/ARXIV.1309.6202.

Bao, Keqin, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023. “TALLRec: An Effective and Efficient Tuning Framework to Align Large Language Model with Recommendation.” Proceedings of the 17th ACM Conference on Recommender Systems, September. https://doi.org/10.1145/3604915.3608857.

Barrot, Jessie S. 2023. “Using ChatGPT for Second Language Writing: Pitfalls and Potentials.” Assessing Writing 57 (July). https://doi.org/10.1016/j.asw.2023.100745.

Beieler, John. 2016. “Generating Politically-Relevant Event Data.” arXiv. https://doi.org/10.48550/ARXIV.1609.06239.

Bengio, Yoshua, Nicholas Léonard, and Aaron Courville. 2013. “Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation,” August. http://arxiv.org/abs/1308.3432v1.

Bérard, Alexandre, Ioan Calapodescu, and Claude Roux. 2019. “Naver Labs Europe’s Systems for the WMT19 Machine Translation Robustness Task.” arXiv. https://doi.org/10.48550/ARXIV.1907.06488.

Beurer-Kellner, Luca, Marc Fischer, and Martin Vechev. 2023. “Prompting Is Programming: A Query Language for Large Language Models.” Proceedings of the ACM on Programming Languages 7 (June). https://doi.org/10.1145/3591300.

Bhagavatula, Chandra, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Scott Wen-tau Yih, and Yejin Choi. 2019. “Abductive Commonsense Reasoning,” August. http://arxiv.org/abs/1908.05739v2.

Bi, Bin, Chenliang Li, Chen Wu, Ming Yan, Wei Wang, Songfang Huang, Fei Huang, and Luo Si. 2020. “PALM: Pre-Training an Autoencoding&autoregressive Language Model for Context-Conditioned Generation.” arXiv. https://doi.org/10.48550/ARXIV.2004.07159.

Biderman, Stella, Kieran Bicheno, and Leo Gao. 2022. “Datasheet for the Pile.” arXiv. https://doi.org/10.48550/ARXIV.2201.07311.

Biderman, Stella, and Edward Raff. 2022. “Fooling MOSS Detection with Pretrained Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2201.07406.

Biswas, Som. 2023. “Role of ChatGPT in Computer Programming.” Mesopotamian Journal of Computer Science, January. https://doi.org/10.58496/mjcsc/2023/002.

Black, Sid, Gao Leo, Phil Wang, Connor Leahy, and Stella Biderman. 2021. “GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow,” March. https://doi.org/10.5281/ZENODO.5297715.

Blodgett, Su Lin, Lisa Green, and Brendan O’Connor. 2016. “Demographic Dialectal Variation in Social Media: A Case Study of African-American English.” Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. https://doi.org/10.18653/v1/d16-1120.

Blodgett, Su Lin, and Brendan O’Connor. 2017. “Racial Disparity in Natural Language Processing: A Case Study of Social Media African-American English.” arXiv. https://doi.org/10.48550/ARXIV.1707.00061.

Bolotova, Valeria, Vladislav Blinov, Yukun Zheng, W. Bruce Croft, Falk Scholer, and Mark Sanderson. 2020. “Do People and Neural Nets Pay Attention to the Same Words.” Proceedings of the 29th ACM International Conference on Information &Amp; Knowledge Management, October. https://doi.org/10.1145/3340531.3412043.

Bommarito, Michael J, Daniel Martin Katz, and Eric M Detterman. 2018. “LexNLP: Natural Language Processing and Information Extraction for Legal and Regulatory Texts.” arXiv. https://doi.org/10.48550/ARXIV.1806.03688.

Bordes, Antoine, Y-Lan Boureau, and Jason Weston. 2016. “Learning End-to-End Goal-Oriented Dialog.” arXiv. https://doi.org/10.48550/ARXIV.1605.07683.

Bordt, Sebastian, and Ulrike von Luxburg. 2023. “ChatGPT Participates in a Computer Science Exam.” arXiv. https://doi.org/10.48550/ARXIV.2303.09461.

Borji, Ali. 2023. “A Categorical Archive of ChatGPT Failures.” arXiv. https://doi.org/10.48550/ARXIV.2302.03494.

Bowman, Samuel R., Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. “A Large Annotated Corpus for Learning Natural Language Inference.” arXiv. https://doi.org/10.48550/ARXIV.1508.05326.

Bozkurt, Aras, and Ramesh C. Sharma. 2023. “Generative AI and Prompt Engineering: The Art of Whispering to Let the Genie Out of the Algorithmic World,” July. https://doi.org/10.5281/ZENODO.8174941.

Breiman, Leo. 1957. “The Individual Ergodic Theorem of Information Theory.” The Annals of Mathematical Statistics 28 (September). https://doi.org/10.1214/aoms/1177706899.

Brunner, Gino, Yang Liu, Damián Pascual, Oliver Richter, Massimiliano Ciaramita, and Roger Wattenhofer. 2019. “On Identifiability in Transformers.” arXiv. https://doi.org/10.48550/ARXIV.1908.04211.

Brysbaert, Marc, and Boris New. 2009. “Moving Beyond Kučera and Francis: A Critical Evaluation of Current Word Frequency Norms and the Introduction of a New and Improved Word Frequency Measure for American English.” Behavior Research Methods 41 (November). https://doi.org/10.3758/brm.41.4.977.

Bubeck, Sébastien, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, et al. 2023. “Sparks of Artificial General Intelligence: Early Experiments with GPT-4.” arXiv. https://doi.org/10.48550/ARXIV.2303.12712.

Bui, Nghi D. Q., Yijun Yu, and Lingxiao Jiang. 2019. “SAR: Learning Cross-Language API Mappings with Little Knowledge.” Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, August. https://doi.org/10.1145/3338906.3338924.

Burnap, Pete, and Matthew L. Williams. 2015. “Cyber Hate Speech on Twitter: An Application of Machine Classification and Statistical Modeling for Policy and Decision Making.” Policy &Amp; Internet 7 (April). https://doi.org/10.1002/poi3.85.

Calvo, Manuel G., and Enrique Meseguer. 2002. “Eye Movements and Processing Stages in Reading: Relative Contribution of Visual, Lexical, and Contextual Factors.” The Spanish Journal of Psychology 5 (May). https://doi.org/10.1017/s1138741600005849.

Camacho-Collados, Jose, Claudio Delli Bovi, Alessandro Raganato, and Roberto Navigli. 2018. “SenseDefs: A Multilingual Corpus of Semantically Annotated Textual Definitions.” Language Resources and Evaluation 53 (July). https://doi.org/10.1007/s10579-018-9421-3.

Cao, Boxi, Hongyu Lin, Xianpei Han, and Le Sun. 2023. “The Life Cycle of Knowledge in Big Language Models: A Survey.” arXiv. https://doi.org/10.48550/ARXIV.2303.07616.

Cascella, Marco, Jonathan Montomoli, Valentina Bellini, and Elena Bignami. 2023. “Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios.” Journal of Medical Systems 47 (March). https://doi.org/10.1007/s10916-023-01925-4.

Cetto, Matthias, Christina Niklaus, André Freitas, and Siegfried Handschuh. 2018. “Graphene: A Context-Preserving Open Information Extraction System.” arXiv. https://doi.org/10.48550/ARXIV.1808.09463.

Chaaben, Meriem Ben, Lola Burgueño, and Houari Sahraoui. 2022. “Towards Using Few-Shot Prompt Learning for Automating Model Completion.” arXiv. https://doi.org/10.48550/ARXIV.2212.03404.

Chalmers, David J. 2023. “Could a Large Language Model Be Conscious?” arXiv. https://doi.org/10.48550/ARXIV.2303.07103.

Chan, Chunkit, Jiayang Cheng, Weiqi Wang, Yuxin Jiang, Tianqing Fang, Xin Liu, and Yangqiu Song. 2023. “ChatGPT Evaluation on Sentence Level Relations: A Focus on Temporal, Causal, and Discourse Relations.” arXiv. https://doi.org/10.48550/ARXIV.2304.14827.

Chan, William, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals. 2015. “Listen, Attend and Spell.” arXiv. https://doi.org/10.48550/ARXIV.1508.01211.

Chang, Kent K., Mackenzie Cramer, Sandeep Soni, and David Bamman. 2023. “Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4.” arXiv. https://doi.org/10.48550/ARXIV.2305.00118.

Chaplynskyi, Dmytro. 2023. “Introducing UberText 2.0: A Corpus of Modern Ukrainian at Scale.” Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP). https://doi.org/10.18653/v1/2023.unlp-1.1.

Chen, Banghao, Zhaofeng Zhang, Nicolas Langrené, and Shengxin Zhu. 2023. “Unleashing the Potential of Prompt Engineering in Large Language Models: A Comprehensive Review.” arXiv. https://doi.org/10.48550/ARXIV.2310.14735.

Chen, Bo, Xingyi Cheng, Pan Li, Yangli-ao Geng, Jing Gong, Shen Li, Zhilei Bei, et al. 2024. “xTrimoPGLM: Unified 100B-Scale Pre-Trained Transformer for Deciphering the Language of Protein.” arXiv. https://doi.org/10.48550/ARXIV.2401.06199.

Chen, Danqi, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. “Reading Wikipedia to Answer Open-Domain Questions.” arXiv. https://doi.org/10.48550/ARXIV.1704.00051.

Chen, Hailin, Fangkai Jiao, Xingxuan Li, Chengwei Qin, Mathieu Ravaut, Ruochen Zhao, Caiming Xiong, and Shafiq Joty. 2023. “ChatGPT’s One-Year Anniversary: Are Open-Source Large Language Models Catching Up?” arXiv. https://doi.org/10.48550/ARXIV.2311.16989.

Chen, Hui, Pengfei Hong, Wei Han, Navonil Majumder, and Soujanya Poria. 2020. “Dialogue Relation Extraction with Document-Level Heterogeneous Graph Attention Networks.” arXiv. https://doi.org/10.48550/ARXIV.2009.05092.

Chen, Jiaoyan, Yuan He, Yuxia Geng, Ernesto Jimenez-Ruiz, Hang Dong, and Ian Horrocks. 2022. “Contextual Semantic Embeddings for Ontology Subsumption Prediction.” arXiv. https://doi.org/10.48550/ARXIV.2202.09791.

Chen, Lingjiao, Matei Zaharia, and James Zou. 2023. “How Is ChatGPT’s Behavior Changing over Time?” arXiv. https://doi.org/10.48550/ARXIV.2307.09009.

Chen, Mark, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, et al. 2021. “Evaluating Large Language Models Trained on Code.” arXiv. https://doi.org/10.48550/ARXIV.2107.03374.

Chen, Maximillian, Alexandros Papangelis, Chenyang Tao, Seokhwan Kim, Andy Rosenbaum, Yang Liu, Zhou Yu, and Dilek Hakkani-Tur. 2023. “PLACES: Prompting Language Models for Social Conversation Synthesis.” arXiv. https://doi.org/10.48550/ARXIV.2302.03269.

Chen, Shiqi, Yiran Zhao, Jinghan Zhang, I-Chun Chern, Siyang Gao, Pengfei Liu, and Junxian He. 2023. “FELM: Benchmarking Factuality Evaluation of Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2310.00741.

Chen, Weixin, Dawn Song, and Bo Li. 2024. “GRATH: Gradual Self-Truthifying for Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2401.12292.

Chen, Yi, Rui Wang, Haiyun Jiang, Shuming Shi, and Ruifeng Xu. 2023. “Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study.” arXiv. https://doi.org/10.48550/ARXIV.2304.00723.

Cheng, Daixuan, Shaohan Huang, Junyu Bi, Yuefeng Zhan, Jianfeng Liu, Yujing Wang, Hao Sun, Furu Wei, Denvy Deng, and Qi Zhang. 2023. “UPRISE: Universal Prompt Retrieval for Improving Zero-Shot Evaluation.” arXiv. https://doi.org/10.48550/ARXIV.2303.08518.

Cheng, Jiale, Xiao Liu, Kehan Zheng, Pei Ke, Hongning Wang, Yuxiao Dong, Jie Tang, and Minlie Huang. 2023. “Black-Box Prompt Optimization: Aligning Large Language Models Without Model Training.” arXiv. https://doi.org/10.48550/ARXIV.2311.04155.

Cheng, Yu, Jieshan Chen, Qing Huang, Zhenchang Xing, Xiwei Xu, and Qinghua Lu. 2023. “Prompt Sapper: A LLM-Empowered Production Tool for Building AI Chains.” arXiv. https://doi.org/10.48550/ARXIV.2306.12028.

Chiang, Cheng-Han, and Hung-yi Lee. 2023. “Can Large Language Models Be an Alternative to Human Evaluations?” arXiv. https://doi.org/10.48550/ARXIV.2305.01937.

———. 2024. “Over-Reasoning and Redundant Calculation of Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2401.11467.

Clavié, Benjamin, Alexandru Ciceu, Frederick Naylor, Guillaume Soulié, and Thomas Brightwell. 2023. “Large Language Models in the Workplace: A Case Study on Prompt Engineering for Job Type Classification.” arXiv. https://doi.org/10.48550/ARXIV.2303.07142.

“Compiling and Analysing the Spoken British National Corpus 2014.” 2017. International Journal of Corpus Linguistics 22 (November). https://doi.org/10.1075/ijcl.22.3.

Cotton, Debby R. E., Peter A. Cotton, and J. Reuben Shipway. 2023. “Chatting and Cheating: Ensuring Academic Integrity in the Era of ChatGPT.” Innovations in Education and Teaching International, March. https://doi.org/10.1080/14703297.2023.2190148.

Crispino, Nicholas, Kyle Montgomery, Fankun Zeng, Dawn Song, and Chenguang Wang. 2023. “Agent Instructs Large Language Models to Be General Zero-Shot Reasoners.” arXiv. https://doi.org/10.48550/ARXIV.2310.03710.

Dai, Haixing, Zhengliang Liu, Wenxiong Liao, Xiaoke Huang, Yihan Cao, Zihao Wu, Lin Zhao, et al. 2023. “AugGPT: Leveraging ChatGPT for Text Data Augmentation.” arXiv. https://doi.org/10.48550/ARXIV.2302.13007.

Davis, Christopher, Andrew Caines, Øistein Andersen, Shiva Taslimipoor, Helen Yannakoudakis, Zheng Yuan, Christopher Bryant, Marek Rei, and Paula Buttery. 2024. “Prompting Open-Source and Commercial Language Models for Grammatical Error Correction of English Learner Text.” arXiv. https://doi.org/10.48550/ARXIV.2401.07702.

Dempere, Juan, Kennedy Modugu, Allam Hesham, and Lakshmana Kumar Ramasamy. 2023. “The Impact of ChatGPT on Higher Education.” Frontiers in Education 8 (September). https://doi.org/10.3389/feduc.2023.1206936.

Deng, Lingjia, and Janyce Wiebe. 2014. “Sentiment Propagation via Implicature Constraints.” Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. https://doi.org/10.3115/v1/e14-1040.

Deng, Xiang, Yu Su, Alyssa Lees, You Wu, Cong Yu, and Huan Sun. 2021. “ReasonBERT: Pre-Trained to Reason with Distant Supervision.” arXiv. https://doi.org/10.48550/ARXIV.2109.04912.

Deng, Yang, Wenxuan Zhang, Weiwen Xu, Wenqiang Lei, Tat-Seng Chua, and Wai Lam. 2022. “A Unified Multi-Task Learning Framework for Multi-Goal Conversational Recommender Systems.” arXiv. https://doi.org/10.48550/ARXIV.2204.06923.

Denny, Paul, Viraj Kumar, and Nasser Giacaman. 2022. “Conversing with Copilot: Exploring Prompt Engineering for Solving CS1 Problems Using Natural Language.” arXiv. https://doi.org/10.48550/ARXIV.2210.15157.

Deshpande, Ameet, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. 2023. “Toxicity in ChatGPT: Analyzing Persona-Assigned Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2304.05335.

Dhingra, Bhuwan, Hanxiao Liu, Ruslan Salakhutdinov, and William W. Cohen. 2017. “A Comparative Study of Word Embeddings for Reading Comprehension.” arXiv. https://doi.org/10.48550/ARXIV.1703.00993.

Diao, Shizhe, Pengcheng Wang, Yong Lin, and Tong Zhang. 2023. “Active Prompting with Chain-of-Thought for Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2302.12246.

Doğruöz, A. Seza, Sunayana Sitaram, Barbara E. Bullock, and Almeida Jacqueline Toribio. 2023. “A Survey of Code-Switching: Linguistic and Social Perspectives for Language Technologies.” arXiv. https://doi.org/10.48550/ARXIV.2301.01967.

Dohare, Shibhansh, Harish Karnick, and Vivek Gupta. 2017. “Text Summarization Using Abstract Meaning Representation.” arXiv. https://doi.org/10.48550/ARXIV.1706.01678.

Dong, Zhendong, and Qiang Dong. 2006. “Hownet and the Computation of Meaning,” February. https://doi.org/10.1142/5935.

Douglass, Rex W., Thomas Leo Scherer, J. Andrés Gannon, Erik Gartzke, Jon Lindsay, Shannon Carcelli, Jonathan Wilkenfeld, et al. 2022. “Introducing the ICBe Dataset: Very High Recall and Precision Event Extraction from Narratives about International Crises.” arXiv. https://doi.org/10.48550/ARXIV.2202.07081.

Dunn, Matthew, Levent Sagun, Mike Higgins, V. Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. “SearchQA: A New q&a Dataset Augmented with Context from a Search Engine.” arXiv. https://doi.org/10.48550/ARXIV.1704.05179.

Ehsan, Upol, Brent Harrison, Larry Chan, and Mark O. Riedl. 2017. “Rationalization: A Neural Machine Translation Approach to Generating Natural Language Explanations.” arXiv. https://doi.org/10.48550/ARXIV.1702.07826.

Eisner, Jason. 2003. “Learning Non-Isomorphic Tree Mappings for Machine Translation.” Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - ACL ’03. https://doi.org/10.3115/1075178.1075217.

Eloundou, Tyna, Sam Manning, Pamela Mishkin, and Daniel Rock. 2023. “GPTs Are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2303.10130.

Eric, Mihail, Rahul Goel, Shachi Paul, Adarsh Kumar, Abhishek Sethi, Peter Ku, Anuj Kumar Goyal, Sanchit Agarwal, Shuyang Gao, and Dilek Hakkani-Tur. 2019. “MultiWOZ 2.1: A Consolidated Multi-Domain Dialogue Dataset with State Corrections and State Tracking Baselines.” arXiv. https://doi.org/10.48550/ARXIV.1907.01669.

Ettinger, Allyson. 2019. “What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models.” arXiv. https://doi.org/10.48550/ARXIV.1907.13528.

Evans, David A., and Chengxiang Zhai. 1996. “Noun-Phrase Analysis in Unrestricted Text for Information Retrieval.” arXiv. https://doi.org/10.48550/ARXIV.CMP-LG/9605019.

Ezeani, Ignatius, Mahmoud El-Haj, Jonathan Morris, and Dawn Knight. 2022. “Introducing the Welsh Text Summarisation Dataset and Baseline Systems.” arXiv. https://doi.org/10.48550/ARXIV.2205.02545.

Fathullah, Yassir, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, Jinxi Guo, et al. 2023. “Prompting Large Language Models with Speech Recognition Abilities.” arXiv. https://doi.org/10.48550/ARXIV.2307.11795.

Fedus, William, Ian Goodfellow, and Andrew M. Dai. 2018. “MaskGAN: Better Text Generation via Filling in The______.” arXiv. https://doi.org/10.48550/ARXIV.1801.07736.

Felten, Ed, Manav Raj, and Robert Seamans. 2023. “How Will Language Modelers Like ChatGPT Affect Occupations and Industries?” arXiv. https://doi.org/10.48550/ARXIV.2303.01157.

Flanagin, Annette, Kirsten Bibbins-Domingo, Michael Berkwits, and Stacy L. Christiansen. 2023. “Nonhuman ‘Authors’ and Implications for the Integrity of Scientific Publication and Medical Knowledge.” JAMA 329 (February). https://doi.org/10.1001/jama.2023.1344.

Frieder, Simon, Luca Pinchetti, Alexis Chevalier, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, and Julius Berner. 2023. “Mathematical Capabilities of ChatGPT.” arXiv. https://doi.org/10.48550/ARXIV.2301.13867.

Fu, Jinlan, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2023. “GPTScore: Evaluate as You Desire.” arXiv. https://doi.org/10.48550/ARXIV.2302.04166.

Gan, Chengguang, and Tatsunori Mori. 2023. “Sensitivity and Robustness of Large Language Models to Prompt Template in Japanese Text Classification Tasks.” arXiv. https://doi.org/10.48550/ARXIV.2305.08714.

Ganguli, Deep, Amanda Askell, Nicholas Schiefer, Thomas I. Liao, Kamilė Lukošiūtė, Anna Chen, Anna Goldie, et al. 2023. “The Capacity for Moral Self-Correction in Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2302.07459.

Gao, Jianfeng, Mu Li, Chang-Ning Huang, and Andi Wu. 2005. “Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach.” Computational Linguistics 31 (December). https://doi.org/10.1162/089120105775299177.

Gao, Jun, Huan Zhao, Changlong Yu, and Ruifeng Xu. 2023. “Exploring the Feasibility of ChatGPT for Event Extraction.” arXiv. https://doi.org/10.48550/ARXIV.2303.03836.

Gao, Mingqi, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, and Xiaojun Wan. 2023. “Human-Like Summarization Evaluation with ChatGPT.” arXiv. https://doi.org/10.48550/ARXIV.2304.02554.

Ge, Yingqiang, Wenyue Hua, Kai Mei, Jianchao Ji, Juntao Tan, Shuyuan Xu, Zelong Li, and Yongfeng Zhang. 2023. “OpenAGI: When LLM Meets Domain Experts.” arXiv. https://doi.org/10.48550/ARXIV.2304.04370.

Gentzkow, Matthew, Bryan Kelly, and Matt Taddy. 2019. “Text as Data.” Journal of Economic Literature 57 (September). https://doi.org/10.1257/jel.20181020.

Giulianelli, Mario, Jacqueline Harding, Florian Mohnert, Dieuwke Hupkes, and Willem Zuidema. 2018. “Under the Hood: Using Diagnostic Classifiers to Investigate and Improve How Language Models Track Agreement Information.” arXiv. https://doi.org/10.48550/ARXIV.1808.08079.

Gjurković, Matej, Mladen Karan, Iva Vukojević, Mihaela Bošnjak, and Jan Šnajder. 2020. “PANDORA Talks: Personality and Demographics on Reddit.” arXiv. https://doi.org/10.48550/ARXIV.2004.04460.

Glockner, Max, Vered Shwartz, and Yoav Goldberg. 2018. “Breaking NLI Systems with Sentences That Require Simple Lexical Inferences.” arXiv. https://doi.org/10.48550/ARXIV.1805.02266.

Goldstein, Josh A., Girish Sastry, Micah Musser, Renee DiResta, Matthew Gentzel, and Katerina Sedova. 2023. “Generative Language Models and Automated Influence Operations: Emerging Threats and Potential Mitigations.” arXiv. https://doi.org/10.48550/ARXIV.2301.04246.

Goodfellow, Ian J., Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. 2013. “An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks.” arXiv. https://doi.org/10.48550/ARXIV.1312.6211.

Gordijn, Bert, and Henk ten Have. 2023. “ChatGPT: Evolution or Revolution?” Medicine, Health Care and Philosophy 26 (January). https://doi.org/10.1007/s11019-023-10136-0.

Gordon, Jonathan, and Lenhart K. Schubert. 2013. “WordNet Hierarchy Axiomatization and the Mass-Count Distinction.” 2013 IEEE Seventh International Conference on Semantic Computing, September. https://doi.org/10.1109/icsc.2013.31.

Grave, Edouard, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. “Learning Word Vectors for 157 Languages.” arXiv. https://doi.org/10.48550/ARXIV.1802.06893.

Grefenstette, Edward, Georgiana Dinu, Yao-Zhong Zhang, Mehrnoosh Sadrzadeh, and Marco Baroni. 2013. “Multi-Step Regression Learning for Compositional Distributional Semantics.” arXiv. https://doi.org/10.48550/ARXIV.1301.6939.

Grimaldi, Gianluca, and Bruno Ehrler. 2023. “AI <i>et Al.</i>: Machines Are about to Change Scientific Publishing Forever.” ACS Energy Letters 8 (January). https://doi.org/10.1021/acsenergylett.2c02828.

Grimmer, Justin, and Brandon M. Stewart. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21. https://doi.org/10.1093/pan/mps028.

Gudibande, Arnav, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, and Dawn Song. 2023. “The False Promise of Imitating Proprietary LLMs.” arXiv. https://doi.org/10.48550/ARXIV.2305.15717.

Guo, Biyang, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, Jianwei Yue, and Yupeng Wu. 2023. “How Close Is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection.” arXiv. https://doi.org/10.48550/ARXIV.2301.07597.

Guo, Weidong, Jiuding Yang, Kaitong Yang, Xiangyang Li, Zhuwei Rao, Yu Xu, and Di Niu. 2023. “Instruction Fusion: Advancing Prompt Evolution Through Hybridization.” arXiv. https://doi.org/10.48550/ARXIV.2312.15692.

Guo, Weiwei, Xiaowei Liu, Sida Wang, Michaeel Kazi, Zhoutong Fu, Huiji Gao, Jun Jia, Liang Zhang, and Bo Long. 2021. “Deep Natural Language Processing for LinkedIn Search Systems.” arXiv. https://doi.org/10.48550/ARXIV.2108.08252.

Gururangan, Suchin, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. “Annotation Artifacts in Natural Language Inference Data.” arXiv. https://doi.org/10.48550/ARXIV.1803.02324.

Hacker, Philipp, Andreas Engel, and Marco Mauer. 2023. “Regulating ChatGPT and Other Large Generative AI Models.” 2023 ACM Conference on Fairness, Accountability, and Transparency, June. https://doi.org/10.1145/3593013.3594067.

Haensch, Anna-Carolina, Sarah Ball, Markus Herklotz, and Frauke Kreuter. 2023. “Seeing ChatGPT Through Students’ Eyes: An Analysis of TikTok Data.” arXiv. https://doi.org/10.48550/ARXIV.2303.05349.

Hagendorff, Thilo. 2023. “Machine Psychology: Investigating Emergent Capabilities and Behavior in Large Language Models Using Psychological Methods.” arXiv. https://doi.org/10.48550/ARXIV.2303.13988.

Halterman, Andrew. 2019. “Geolocating Political Events in Text.” Proceedings of the Third Workshop on Natural Language Processing and Computational Social Science. https://doi.org/10.18653/v1/w19-2104.

Halterman, Andrew, and Benjamin J. Radford. 2021. “Few-Shot Upsampling for Protest Size Detection.” arXiv. https://doi.org/10.48550/ARXIV.2105.11260.

Halterman, Andrew, Philip A. Schrodt, Andreas Beger, Benjamin E. Bagozzi, and Grace I. Scarborough. 2023. “Creating Custom Event Data Without Dictionaries: A Bag-of-Tricks.” arXiv. https://doi.org/10.48550/ARXIV.2304.01331.

Han, Tianyu, Lisa C. Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexander Löser, Daniel Truhn, and Keno K. Bressem. 2023. “MedAlpaca – an Open-Source Collection of Medical Conversational AI Models and Training Data.” arXiv. https://doi.org/10.48550/ARXIV.2304.08247.

He, Xingwei, Zhenghao Lin, Yeyun Gong, A-Long Jin, Hang Zhang, Chen Lin, Jian Jiao, Siu Ming Yiu, Nan Duan, and Weizhu Chen. 2023. “AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators.” arXiv. https://doi.org/10.48550/ARXIV.2303.16854.

Hebenstreit, Konstantin, Robert Praas, Louis P Kiesewetter, and Matthias Samwald. 2023. “An Automatically Discovered Chain-of-Thought Prompt Generalizes to Novel Models and Datasets.” arXiv. https://doi.org/10.48550/ARXIV.2305.02897.

Henderson, Peter, Xuechen Li, Dan Jurafsky, Tatsunori Hashimoto, Mark A. Lemley, and Percy Liang. 2023. “Foundation Models and Fair Use.” arXiv. https://doi.org/10.48550/ARXIV.2303.15715.

Heston, Thomas F. 2023. “Prompt Engineering for Students of Medicine and Their Teachers.” arXiv. https://doi.org/10.48550/ARXIV.2308.11628.

Hiemstra, Djoerd. 1998. “A Linguistically Motivated Probabilistic Model of Information Retrieval.” Research and Advanced Technology for Digital Libraries. https://doi.org/10.1007/3-540-49653-x_34.

Ho, Xanh, Anh Khoa Duong Nguyen, An Tuan Dao, Junfeng Jiang, Yuki Chida, Kaito Sugimoto, Huy Quoc To, Florian Boudin, and Akiko Aizawa. 2024. “A Survey of Pre-Trained Language Models for Processing Scientific Text.” arXiv. https://doi.org/10.48550/ARXIV.2401.17824.

Hong, Junyuan, Jiachen T. Wang, Chenhui Zhang, Zhangheng Li, Bo Li, and Zhangyang Wang. 2023. “DP-OPT: Make Large Language Model Your Privacy-Preserving Prompt Engineer.” arXiv. https://doi.org/10.48550/ARXIV.2312.03724.

Hopkins, Daniel J., and Gary King. 2009. “A Method of Automated Nonparametric Content Analysis for Social Science.” American Journal of Political Science 54 (December). https://doi.org/10.1111/j.1540-5907.2009.00428.x.

Hsu, I-Hung, Kuan-Hao Huang, Elizabeth Boschee, Scott Miller, Prem Natarajan, Kai-Wei Chang, and Nanyun Peng. 2021. “DEGREE: A Data-Efficient Generation-Based Event Extraction Model.” arXiv. https://doi.org/10.48550/ARXIV.2108.12724.

Hu, Xinyu, Pengfei Tang, Simiao Zuo, Zihan Wang, Bowen Song, Qiang Lou, Jian Jiao, and Denis Charles. 2023. “Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing.” arXiv. https://doi.org/10.48550/ARXIV.2310.13855.

Hu, Xueyu, Kun Kuang, Jiankai Sun, Hongxia Yang, and Fei Wu. 2024. “Leveraging Print Debugging to Improve Code Generation in Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2401.05319.

Hu, Zhiting, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P. Xing. 2017. “Toward Controlled Generation of Text.” arXiv. https://doi.org/10.48550/ARXIV.1703.00955.

Huang, Lianzhe, Shuming Ma, Dongdong Zhang, Furu Wei, and Houfeng Wang. 2022. “Zero-Shot Cross-Lingual Transfer of Prompt-Based Tuning with a Unified Multilingual Prompt.” arXiv. https://doi.org/10.48550/ARXIV.2202.11451.

Huang, Rongjie, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, et al. 2023. “AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head.” arXiv. https://doi.org/10.48550/ARXIV.2304.12995.

Huang, Saffron, and Divya Siddarth. 2023. “Generative AI and the Digital Commons.” arXiv. https://doi.org/10.48550/ARXIV.2303.11074.

Huang, Yuzhen, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, et al. 2023. “C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models.” arXiv. https://doi.org/10.48550/ARXIV.2305.08322.

Hulsebos, Madelon, Çagatay Demiralp, and Paul Groth. 2023. “GitTables: A Large-Scale Corpus of Relational Tables.” Proceedings of the ACM on Management of Data 1 (May). https://doi.org/10.1145/3588710.

Imani, Shima, Liang Du, and Harsh Shrivastava. 2023. “MathPrompter: Mathematical Reasoning Using Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2303.05398.

Jalil, Sajed, Suzzana Rafi, Thomas D. LaToza, Kevin Moran, and Wing Lam. 2023. “ChatGPT and Software Testing Education: Promises &Amp; Perils.” 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), April. https://doi.org/10.1109/icstw58534.2023.00078.

Jang, Myeongjun Erik, and Thomas Lukasiewicz. 2023. “Consistency Analysis of ChatGPT.” arXiv. https://doi.org/10.48550/ARXIV.2303.06273.

Jankowski, Michael, and Robert A. Huber. 2023. “When Correlation Is Not Enough: Validating Populism Scores from Supervised Machine-Learning Models.” Political Analysis 31 (January). https://doi.org/10.1017/pan.2022.32.

Jannach, Dietmar, Ahtsham Manzoor, Wanling Cai, and Li Chen. 2021. “A Survey on Conversational Recommender Systems.” ACM Computing Surveys 54 (May). https://doi.org/10.1145/3453154.

Jeronymo, Vitor, Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Jakub Zavrel, and Rodrigo Nogueira. 2023. “InPars-V2: Large Language Models as Efficient Dataset Generators for Information Retrieval.” arXiv. https://doi.org/10.48550/ARXIV.2301.01820.

Ji, Yunjie, Yong Deng, Yan Gong, Yiping Peng, Qiang Niu, Lei Zhang, Baochang Ma, and Xiangang Li. 2023. “Exploring the Impact of Instruction Data Scaling on Large Language Models: An Empirical Study on Real-World Use Cases.” arXiv. https://doi.org/10.48550/ARXIV.2303.14742.

Jia, Ye, Heiga Zen, Jonathan Shen, Yu Zhang, and Yonghui Wu. 2021. “PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS.” arXiv. https://doi.org/10.48550/ARXIV.2103.15060.

Jiang, Gangwei, Caigao Jiang, Siqiao Xue, James Y. Zhang, Jun Zhou, Defu Lian, and Ying Wei. 2023. “Towards Anytime Fine-Tuning: Continually Pre-Trained Language Models with Hypernetwork Prompt.” arXiv. https://doi.org/10.48550/ARXIV.2310.13024.

Jiang, Nan, Kevin Liu, Thibaud Lutellier, and Lin Tan. 2023. “Impact of Code Language Models on Automated Program Repair.” arXiv. https://doi.org/10.48550/ARXIV.2302.05020.

Jiao, Wenxiang, Wenxuan Wang, Jen-tse Huang, Xing Wang, Shuming Shi, and Zhaopeng Tu. 2023. “Is ChatGPT a Good Translator? Yes with GPT-4 as the Engine.” arXiv. https://doi.org/10.48550/ARXIV.2301.08745.

Joulin, Armand, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. “Bag of Tricks for Efficient Text Classification.” arXiv. https://doi.org/10.48550/ARXIV.1607.01759.

Junprung, Edward. 2023. “Exploring the Intersection of Large Language Models and Agent-Based Modeling via Prompt Engineering.” arXiv. https://doi.org/10.48550/ARXIV.2308.07411.

Kalai, Adam Tauman, and Santosh S. Vempala. 2023. “Calibrated Language Models Must Hallucinate.” arXiv. https://doi.org/10.48550/ARXIV.2311.14648.

Kang, Wang-Cheng, Jianmo Ni, Nikhil Mehta, Maheswaran Sathiamoorthy, Lichan Hong, Ed Chi, and Derek Zhiyuan Cheng. 2023. “Do LLMs Understand User Preferences? Evaluating LLMs on User Rating Prediction.” arXiv. https://doi.org/10.48550/ARXIV.2305.06474.

Kazemitabaar, Majeed, Justin Chow, Carl Ka To Ma, Barbara J. Ericson, David Weintrop, and Tovi Grossman. 2023. “Studying the Effect of AI Code Generators on Supporting Novice Learners in Introductory Programming.” Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, April. https://doi.org/10.1145/3544548.3580919.

Kharitonov, Eugene, Damien Vincent, Zalán Borsos, Raphaël Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, and Neil Zeghidour. 2023. “Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision.” arXiv. https://doi.org/10.48550/ARXIV.2302.03540.

Khatun, Aisha, and Daniel G. Brown. 2023. “Reliability Check: An Analysis of GPT-3’s Response to Sensitive Topics and Prompt Wording.” arXiv. https://doi.org/10.48550/ARXIV.2306.06199.

Kiciman, Emre, Scott Counts, and Melissa Gasser. 2018. “Using Longitudinal Social Media Analysis to Understand the Effects of Early College Alcohol Use.” Proceedings of the International AAAI Conference on Web and Social Media 12 (June). https://doi.org/10.1609/icwsm.v12i1.15012.

Kilgarriff, Adam. 1997. “"I Don’t Believe in Word Senses".” arXiv. https://doi.org/10.48550/ARXIV.CMP-LG/9712006.

Kim, Geunwoo, Pierre Baldi, and Stephen McAleer. 2023. “Language Models Can Solve Computer Tasks.” arXiv. https://doi.org/10.48550/ARXIV.2303.17491.

Kim, Yoon, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde, and Slav Petrov. 2014. “Temporal Analysis of Language Through Neural Language Models.” arXiv. https://doi.org/10.48550/ARXIV.1405.3515.

Kim, Yubin, Xuhai Xu, Daniel McDuff, Cynthia Breazeal, and Hae Won Park. 2024. “Health-LLM: Large Language Models for Health Prediction via Wearable Sensor Data.” arXiv. https://doi.org/10.48550/ARXIV.2401.06866.

Kirby, Simon. 2002. “Natural Language from Artificial Life.” Artificial Life 8 (April). https://doi.org/10.1162/106454602320184248.

Kirby, Simon, Mike Dowman, and Thomas L. Griffiths. 2007. “Innateness and Culture in the Evolution of Language.” Proceedings of the National Academy of Sciences 104 (March). https://doi.org/10.1073/pnas.0608222104.

Kirk, Hannah Rose, Bertie Vidgen, Paul Röttger, and Scott A. Hale. 2023. “Personalisation Within Bounds: A Risk Taxonomy and Policy Framework for the Alignment of Large Language Models with Personalised Feedback.” arXiv. https://doi.org/10.48550/ARXIV.2303.05453.

Kirov, Christo, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, et al. 2018. “UniMorph 2.0: Universal Morphology.” arXiv. https://doi.org/10.48550/ARXIV.1810.11101.

Kitaev, Nikita, and Dan Klein. 2018. “Constituency Parsing with a Self-Attentive Encoder.” Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). https://doi.org/10.18653/v1/p18-1249.

Kocmi, Tom, and Christian Federmann. 2023. “Large Language Models Are State-of-the-Art Evaluators of Translation Quality.” arXiv. https://doi.org/10.48550/ARXIV.2302.14520.

Köpf, Andreas, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, et al. 2023. “OpenAssistant Conversations – Democratizing Large Language Model Alignment.” arXiv. https://doi.org/10.48550/ARXIV.2304.07327.

Kortemeyer, Gerd. 2023. “Could an Artificial-Intelligence Agent Pass an Introductory Physics Course?” Physical Review Physics Education Research 19 (May). https://doi.org/10.1103/physrevphyseducres.19.010132.

Kosinski, Michal. 2023. “Theory of Mind Might Have Spontaneously Emerged in Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2302.02083.

Kreiner, Hamutal, and Zohar Eviatar. 2014. “The Missing Link in the Embodiment of Syntax: Prosody.” Brain and Language 137 (October). https://doi.org/10.1016/j.bandl.2014.08.004.

Krishna, Kalpesh, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. 2023. “Paraphrasing Evades Detectors of AI-Generated Text, but Retrieval Is an Effective Defense.” arXiv. https://doi.org/10.48550/ARXIV.2303.13408.

Kumar, Krishna. 2023. “Geotechnical Parrot Tales (GPT): Harnessing Large Language Models in Geotechnical Engineering.” arXiv. https://doi.org/10.48550/ARXIV.2304.02138.

Lai, Viet Dac, Nghia Trung Ngo, Amir Pouran Ben Veyseh, Hieu Man, Franck Dernoncourt, Trung Bui, and Thien Huu Nguyen. 2023. “ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning.” arXiv. https://doi.org/10.48550/ARXIV.2304.05613.

Lei, Ioktong, and Zhidong Deng. 2023. “SelfzCoT: A Self-Prompt Zero-Shot CoT from Semantic-Level to Code-Level for a Better Utilization of LLMs.” arXiv. https://doi.org/10.48550/ARXIV.2305.11461.

Leidinger, Alina, Robert van Rooij, and Ekaterina Shutova. 2023. “The Language of Prompting: What Linguistic Properties Make a Prompt Successful?” arXiv. https://doi.org/10.48550/ARXIV.2311.01967.

Leinonen, Juho, Paul Denny, Stephen MacNeil, Sami Sarsa, Seth Bernstein, Joanne Kim, Andrew Tran, and Arto Hellas. 2023. “Comparing Code Explanations Created by Students and Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2304.03938.

Leser, U., and J. Hakenberg. 2005. “What Makes a Gene Name? Named Entity Recognition in the Biomedical Literature.” Briefings in Bioinformatics 6 (January). https://doi.org/10.1093/bib/6.4.357.

Levine, Yoav, Barak Lenz, Or Dagan, Ori Ram, Dan Padnos, Or Sharir, Shai Shalev-Shwartz, Amnon Shashua, and Yoav Shoham. 2019. “SenseBERT: Driving Some Sense into BERT.” arXiv. https://doi.org/10.48550/ARXIV.1908.05646.

Li, Bowen, Kwang Hee Yoo, Zhi Wang, André L. Boehman, and Jianxin Wang. 2019. “Experimental and Numerical Study on Autoignition Characteristics of the Gasoline/Diesel/Ethanol and Gasoline/Diesel/PODE/Ethanol Fuels.” Energy &Amp; Fuels 33 (October). https://doi.org/10.1021/acs.energyfuels.9b02013.

Li, Changmao, and Jeffrey Flanigan. 2023. “Task Contamination: Language Models May Not Be Few-Shot Anymore.” arXiv. https://doi.org/10.48550/ARXIV.2312.16337.

Li, Guohao, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. “CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society.” arXiv. https://doi.org/10.48550/ARXIV.2303.17760.

Li, Haoran, Dadi Guo, Wei Fan, Mingshi Xu, Jie Huang, Fanpu Meng, and Yangqiu Song. 2023. “Multi-Step Jailbreaking Privacy Attacks on ChatGPT.” arXiv. https://doi.org/10.48550/ARXIV.2304.05197.

Li, Jiangmeng, Fei Song, Yifan Jin, Wenwen Qiang, Changwen Zheng, Fuchun Sun, and Hui Xiong. 2024. “BayesPrompt: Prompting Large-Scale Pre-Trained Language Models on Few-Shot Inference via Debiased Domain Abstraction.” arXiv. https://doi.org/10.48550/ARXIV.2401.14166.

Li, Minghao, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. 2023. “API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs.” arXiv. https://doi.org/10.48550/ARXIV.2304.08244.

Li, Piji, Lidong Bing, and Wai Lam. 2018. “Actor-Critic Based Training Framework for Abstractive Summarization.” arXiv. https://doi.org/10.48550/ARXIV.1803.11070.

Li, Shaobo, Xiaoguang Li, Lifeng Shang, Zhenhua Dong, Chengjie Sun, Bingquan Liu, Zhenzhou Ji, Xin Jiang, and Qun Liu. 2022. “How Pre-Trained Language Models Capture Factual Knowledge? A Causal-Inspired Analysis.” arXiv. https://doi.org/10.48550/ARXIV.2203.16747.

Li, Shuai, Zhao Song, Yu Xia, Tong Yu, and Tianyi Zhou. 2023. “The Closeness of in-Context Learning and Weight Shifting for Softmax Regression.” arXiv. https://doi.org/10.48550/ARXIV.2304.13276.

Li, Yuanchun, and Oriana Riva. 2021. “Glider.” Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, July. https://doi.org/10.1145/3404835.3462905.

Li, Yujian Betterest, and Kai Wu. 2023. “SPELL: Semantic Prompt Evolution Based on a LLM.” arXiv. https://doi.org/10.48550/ARXIV.2310.01260.

Liang, Jacky, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. 2022. “Code as Policies: Language Model Programs for Embodied Control.” arXiv. https://doi.org/10.48550/ARXIV.2209.07753.

Liang, Weixin, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou. 2023. “GPT Detectors Are Biased Against Non-Native English Writers.” arXiv. https://doi.org/10.48550/ARXIV.2304.02819.

Liévin, Valentin, Christoffer Egeberg Hother, Andreas Geert Motzfeldt, and Ole Winther. 2022. “Can Large Language Models Reason about Medical Questions?” arXiv. https://doi.org/10.48550/ARXIV.2207.08143.

Lim, Sue, and Ralf Schmälzle. 2022. “Artificial Intelligence for Health Message Generation: Theory, Method, and an Empirical Study Using Prompt Engineering.” arXiv. https://doi.org/10.48550/ARXIV.2212.07507.

Lin, Hao, Pradeep Nalluri, Lantian Li, Yifan Sun, and Yongjun Zhang. 2022. “Multiplex Anti-Asian Sentiment Before and During the Pandemic: Introducing New Datasets from Twitter Mining.” Proceedings of the 12th Workshop on Computational Approaches to Subjectivity, Sentiment &Amp; Social Media Analysis. https://doi.org/10.18653/v1/2022.wassa-1.2.

Lin, Stephanie, Jacob Hilton, and Owain Evans. 2021. “TruthfulQA: Measuring How Models Mimic Human Falsehoods.” arXiv. https://doi.org/10.48550/ARXIV.2109.07958.

Lin, Wan-Hsuan, and Chun-Shien Lu. 2020. “Automated Graph Generation at Sentence Level for Reading Comprehension Based on Conceptual Graphs.” Proceedings of the 28th International Conference on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.240.

Lin, Yongjie, Yi Chern Tan, and Robert Frank. 2019. “Open Sesame: Getting Inside BERT’s Linguistic Knowledge.” arXiv. https://doi.org/10.48550/ARXIV.1906.01698.

Liu, Alisa, Xiaochuang Han, Yizhong Wang, Yulia Tsvetkov, Yejin Choi, and Noah A. Smith. 2024. “Tuning Language Models by Proxy.” arXiv. https://doi.org/10.48550/ARXIV.2401.08565.

Liu, Hao, Carmelo Sferrazza, and Pieter Abbeel. 2023. “Chain of Hindsight Aligns Language Models with Feedback.” arXiv. https://doi.org/10.48550/ARXIV.2302.02676.

Liu, Haozhe, Wentian Zhang, Bing Li, Haoqian Wu, Nanjun He, Yawen Huang, Yuexiang Li, Bernard Ghanem, and Yefeng Zheng. 2023. “Improving GAN Training via Feature Space Shrinkage.” arXiv. https://doi.org/10.48550/ARXIV.2303.01559.

Liu, Jiacheng, Sewon Min, Luke Zettlemoyer, Yejin Choi, and Hannaneh Hajishirzi. 2024. “Infini-Gram: Scaling Unbounded n-Gram Language Models to a Trillion Tokens.” arXiv. https://doi.org/10.48550/ARXIV.2401.17377.

Liu, Pengfei, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021. “Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.” arXiv. https://doi.org/10.48550/ARXIV.2107.13586.

Liu, Siyu, Tongqi Wen, A. S. L. Subrahmanyam Pattamatta, and David J. Srolovitz. 2024. “A Prompt-Engineered Large Language Model, Deep Learning Workflow for Materials Classification.” arXiv. https://doi.org/10.48550/ARXIV.2401.17788.

Liu, Weijie, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, and Ping Wang. 2019. “K-BERT: Enabling Language Representation with Knowledge Graph.” arXiv. https://doi.org/10.48550/ARXIV.1909.07606.

Liu, Yang, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. “G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment.” arXiv. https://doi.org/10.48550/ARXIV.2303.16634.

Liu, Yang, Fanyou Wu, Zhiyuan Liu, Kai Wang, Feiyue Wang, and Xiaobo Qu. 2023. “Can Language Models Be Used for Real-World Urban-Delivery Route Optimization?” The Innovation 4 (November). https://doi.org/10.1016/j.xinn.2023.100520.

“Logic and Lexicon.” 1995. Studies in Linguistics and Philosophy. https://doi.org/10.1007/978-94-015-8445-6.

Lu, Ximing, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin Choi. 2022. “Quark: Controllable Text Generation with Reinforced Unlearning.” arXiv. https://doi.org/10.48550/ARXIV.2205.13636.

Lu, Yao, Jiayi Wang, Sebastian Riedel, and Pontus Stenetorp. 2023. “Prompt Optimisation with Random Sampling.” arXiv. https://doi.org/10.48550/ARXIV.2311.09569.

Luo, Guoqing, Yu Tong Han, Lili Mou, and Mauajama Firdaus. 2023. “Prompt-Based Editing for Text Style Transfer.” arXiv. https://doi.org/10.48550/ARXIV.2301.11997.

Luo, Hengyu, Peng Liu, and Stefan Esping. 2023. “Exploring Small Language Models with Prompt-Learning Paradigm for Efficient Domain-Specific Text Classification.” arXiv. https://doi.org/10.48550/ARXIV.2309.14779.

Luo, Zheheng, Qianqian Xie, and Sophia Ananiadou. 2023. “ChatGPT as a Factual Inconsistency Evaluator for Text Summarization.” arXiv. https://doi.org/10.48550/ARXIV.2303.15621.

Lyu, Yuanjie, Zhiyu Li, Simin Niu, Feiyu Xiong, Bo Tang, Wenjin Wang, Hao Wu, Huanyong Liu, Tong Xu, and Enhong Chen. 2024. “CRUD-RAG: A Comprehensive Chinese Benchmark for Retrieval-Augmented Generation of Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2401.17043.

Ma, Chenkai. 2023. “Prompt Engineering and Calibration for Zero-Shot Commonsense Reasoning.” arXiv. https://doi.org/10.48550/ARXIV.2304.06962.

Ma, Ruotian, Xin Zhou, Tao Gui, Yiding Tan, Linyang Li, Qi Zhang, and Xuanjing Huang. 2021. “Template-Free Prompt Tuning for Few-Shot NER.” arXiv. https://doi.org/10.48550/ARXIV.2109.13532.

Ma, Wanqin, Chenyang Yang, and Christian Kästner. 2023. “(Why) Is My Prompt Getting Worse? Rethinking Regression Testing for Evolving LLM APIs.” arXiv. https://doi.org/10.48550/ARXIV.2311.11123.

Ma, Yubo, Yixin Cao, YongChing Hong, and Aixin Sun. 2023. “Large Language Model Is Not a Good Few-Shot Information Extractor, but a Good Reranker for Hard Samples!” arXiv. https://doi.org/10.48550/ARXIV.2303.08559.

“Machine Learning: ECML 2004.” 2004. Lecture Notes in Computer Science. https://doi.org/10.1007/b100702.

MacNeil, Stephen, Andrew Tran, Arto Hellas, Joanne Kim, Sami Sarsa, Paul Denny, Seth Bernstein, and Juho Leinonen. 2022. “Experiences from Using Code Explanations Generated by Large Language Models in a Web Software Development e-Book.” arXiv. https://doi.org/10.48550/ARXIV.2211.02265.

MacNeil, Stephen, Andrew Tran, Joanne Kim, Ziheng Huang, Seth Bernstein, and Dan Mogil. 2023. “Prompt Middleware: Mapping Prompts for Large Language Models to UI Affordances.” arXiv. https://doi.org/10.48550/ARXIV.2307.01142.

Mahabadi, Rabeeh Karimi, Luke Zettlemoyer, James Henderson, Marzieh Saeidi, Lambert Mathias, Veselin Stoyanov, and Majid Yazdani. 2022. “PERFECT: Prompt-Free and Efficient Few-Shot Learning with Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2204.01172.

Mahowald, Kyle, Anna A. Ivanova, Idan A. Blank, Nancy Kanwisher, Joshua B. Tenenbaum, and Evelina Fedorenko. 2023. “Dissociating Language and Thought in Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2301.06627.

Manakul, Potsawee, Adian Liusie, and Mark J. F. Gales. 2023. “SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2303.08896.

Mann, Virginia A. 1980. “Influence of Preceding Liquid on Stop-Consonant Perception.” Perception &Amp; Psychophysics 28 (September). https://doi.org/10.3758/bf03204884.

Manning, Christopher D. 2015. “Computational Linguistics and Deep Learning.” Computational Linguistics 41 (December). https://doi.org/10.1162/coli_a_00239.

Mao, Junyu, Stuart E. Middleton, and Mahesan Niranjan. 2023. “Do Prompt Positions Really Matter?” arXiv. https://doi.org/10.48550/ARXIV.2305.14493.

Mao, Rui, Kai He, Xulang Zhang, Guanyi Chen, Jinjie Ni, Zonglin Yang, and Erik Cambria. 2023. “A Survey on Semantic Processing Techniques.” arXiv. https://doi.org/10.48550/ARXIV.2310.18345.

Mark, van, P.J. 2014. “The Stringdist Package for Approximate String Matching.” The R Journal 6. https://doi.org/10.32614/rj-2014-011.

Melamed, Rimon, Lucas H. McCabe, Tanay Wakhare, Yejin Kim, H. Howie Huang, and Enric Boix-Adsera. 2023. “PROPANE: Prompt Design as an Inverse Problem.” arXiv. https://doi.org/10.48550/ARXIV.2311.07064.

Mellon, Jonathan, Jack Bailey, Ralph Scott, James Breckwoldt, and Marta Miori. 2022. “Does GPT-3 Know What the Most Important Issue Is? Using Large Language Models to Code Open-Text Social Survey Responses at Scale.” SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4310154.

Menezes, Daniel Specht, Pedro Savarese, and Ruy Luiz Milidiú. 2019. “Building a Massive Corpus for Named Entity Recognition Using Free Open Data Sources.” arXiv. https://doi.org/10.48550/ARXIV.1908.05758.

Mialon, Grégoire, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, et al. 2023. “Augmented Language Models: A Survey.” arXiv. https://doi.org/10.48550/ARXIV.2302.07842.

Michael, Julian, Ari Holtzman, Alicia Parrish, Aaron Mueller, Alex Wang, Angelica Chen, Divyam Madaan, et al. 2022. “What Do NLP Researchers Believe? Results of the NLP Community Metasurvey.” arXiv. https://doi.org/10.48550/ARXIV.2208.12852.

Michaelov, James A., Megan D. Bardolph, Cyma K. Van Petten, Benjamin K. Bergen, and Seana Coulson. 2024. “Strong Prediction: Language Model Surprisal Explains Multiple N400 Effects.” Neurobiology of Language, January. https://doi.org/10.1162/nol_a_00105.

Miller, Tim. 2021. “Contrastive Explanation: A Structural-Model Approach.” The Knowledge Engineering Review 36. https://doi.org/10.1017/s0269888921000102.

Minato, Takashi, Ryuichiro Higashinaka, Kurima Sakai, Tomo Funayama, Hiromitsu Nishizaki, and Takayuki Naga. 2024. “Overview of Dialogue Robot Competition 2023.” arXiv. https://doi.org/10.48550/ARXIV.2401.03547.

Mirza, Shujaat, Bruno Coelho, Yuyuan Cui, Christina Pöpper, and Damon McCoy. 2024. “Global-Liar: Factuality of LLMs over Time and Geographic Regions.” arXiv. https://doi.org/10.48550/ARXIV.2401.17839.

Mishra, Abhijit, Diptesh Kanojia, Seema Nagar, Kuntal Dey, and Pushpak Bhattacharyya. 2017. “Harnessing Cognitive Features for Sarcasm Detection.” arXiv. https://doi.org/10.48550/ARXIV.1701.05574.

Mishra, Aditi, Utkarsh Soni, Anjana Arunkumar, Jinbin Huang, Bum Chul Kwon, and Chris Bryan. 2023. “PromptAid: Prompt Exploration, Perturbation, Testing and Iteration Using Visual Analytics for Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2304.01964.

Mollick, Ethan R., and Lilach Mollick. 2023. “Using AI to Implement Effective Teaching Strategies in Classrooms: Five Strategies, Including Prompts.” SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4391243.

Monroe, Burt L., Michael P. Colaresi, and Kevin M. Quinn. 2008. “Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict.” Political Analysis 16. https://doi.org/10.1093/pan/mpn018.

Mostafazadeh, Nasrin, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016. “A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories.” arXiv. https://doi.org/10.48550/ARXIV.1604.01696.

Muhammad, Shamsuddeen Hassan, Idris Abdulmumin, Seid Muhie Yimam, David Ifeoluwa Adelani, Ibrahim Sa’id Ahmad, Nedjma Ousidhoum, Abinew Ayele, Saif M. Mohammad, Meriem Beloucif, and Sebastian Ruder. 2023. “SemEval-2023 Task 12: Sentiment Analysis for African Languages (AfriSenti-SemEval).” arXiv. https://doi.org/10.48550/ARXIV.2304.06845.

Muktadir, Golam Md. 2023. “A Brief History of Prompt: Leveraging Language Models. (Through Advanced Prompting).” arXiv. https://doi.org/10.48550/ARXIV.2310.04438.

Nair, Varun, Elliot Schumacher, Geoffrey Tso, and Anitha Kannan. 2023. “DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents.” arXiv. https://doi.org/10.48550/ARXIV.2303.17071.

Nan, Linyong, Yilun Zhao, Weijin Zou, Narutatsu Ri, Jaesung Tae, Ellen Zhang, Arman Cohan, and Dragomir Radev. 2023. “Enhancing Few-Shot Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies.” arXiv. https://doi.org/10.48550/ARXIV.2305.12586.

“Natural Language Processing – IJCNLP 2005.” 2005. Lecture Notes in Computer Science. https://doi.org/10.1007/11562214.

Nigam, Kamal, and Matthew Hurst. n.d. “Towards a Robust Metric of Polarity.” The Information Retrieval Series. https://doi.org/10.1007/1-4020-4102-0_20.

Noord, Rik van, and Johan Bos. 2017. “Neural Semantic Parsing by Character-Based Translation: Experiments with Abstract Meaning Representations,” May. http://arxiv.org/abs/1705.09980v2.

Ortega-Martín, Miguel, Óscar García-Sierra, Alfonso Ardoiz, Jorge Álvarez, Juan Carlos Armenteros, and Adrián Alonso. 2023. “Linguistic Ambiguity Analysis in ChatGPT.” arXiv. https://doi.org/10.48550/ARXIV.2302.06426.

Panchenko, Alexander, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, and Chris Biemann. 2017. “Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl.” arXiv. https://doi.org/10.48550/ARXIV.1710.01779.

Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. 2002. “Thumbs Up?” Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - EMNLP ’02. https://doi.org/10.3115/1118693.1118704.

Parameswaran, Aditya G., Shreya Shankar, Parth Asawa, Naman Jain, and Yujie Wang. 2023. “Revisiting Prompt Engineering via Declarative Crowdsourcing.” arXiv. https://doi.org/10.48550/ARXIV.2308.03854.

Pardos, Zachary A., and Shreya Bhandari. 2023. “Learning Gain Differences Between ChatGPT and Human Tutor Generated Algebra Hints.” arXiv. https://doi.org/10.48550/ARXIV.2302.06871.

Pareschi, Remo. 2023. “Abductive Reasoning with the GPT-4 Language Model: Case Studies from Criminal Investigation, Medical Practice, Scientific Research.” arXiv. https://doi.org/10.48550/ARXIV.2307.10250.

Park, Jeong Yeon, Hyeong Jin Shin, and Jae Sung Lee. 2022. “Word Sense Disambiguation Using Clustered Sense Labels.” Applied Sciences 12 (February). https://doi.org/10.3390/app12041857.

Pasini, Tommaso, and Jose Camacho-Collados. 2018. “A Short Survey on Sense-Annotated Corpora.” arXiv. https://doi.org/10.48550/ARXIV.1802.04744.

Paun, Silviu, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, and Massimo Poesio. 2018. “Comparing Bayesian Models of Annotation.” Transactions of the Association for Computational Linguistics 6 (December). https://doi.org/10.1162/tacl_a_00040.

Pei, Shichao, Lu Yu, and Xiangliang Zhang. 2019. “Improving Cross-Lingual Entity Alignment via Optimal Transport.” Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, August. https://doi.org/10.24963/ijcai.2019/448.

Peng, Baolin, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, et al. 2023. “Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback.” arXiv. https://doi.org/10.48550/ARXIV.2302.12813.

Peng, Baolin, Xiujun Li, Lihong Li, Jianfeng Gao, Asli Celikyilmaz, Sungjin Lee, and Kam-Fai Wong. 2017. “Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning.” Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. https://doi.org/10.18653/v1/d17-1237.

Peng, Fuchun, Dale Schuurmans, and Shaojun Wang. 2004. “Augmenting Naive Bayes Classifiers with Statistical Language Models.” Information Retrieval 7 (September). https://doi.org/10.1023/b:inrt.0000011209.19643.e2.

Perez, Fábio, and Ian Ribeiro. 2022. “Ignore Previous Prompt: Attack Techniques for Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2211.09527.

Pratap, Vineel, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. 2020. “MLS: A Large-Scale Multilingual Dataset for Speech Research.” Interspeech 2020, October. https://doi.org/10.21437/interspeech.2020-2826.

“Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).” 2014. https://doi.org/10.3115/v1/d14-1.

“Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.” 2020. https://doi.org/10.18653/v1/2020.acl-main.

“Proceedings of the Web Conference 2021.” 2021, April. https://doi.org/10.1145/3442381.

“Proceedings of Third International Conference on Sustainable Expert Systems.” 2023. Lecture Notes in Networks and Systems. https://doi.org/10.1007/978-981-19-7874-6.

Qi, Xiangyu, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2023. “Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” arXiv. https://doi.org/10.48550/ARXIV.2310.03693.

Qin, Chengwei, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang. 2023. “Is ChatGPT a General-Purpose Natural Language Processing Task Solver?” arXiv. https://doi.org/10.48550/ARXIV.2302.06476.

Qin, Libo, Wanxiang Che, Yangming Li, Haoyang Wen, and Ting Liu. 2019. “A Stack-Propagation Framework with Token-Level Intent Detection for Spoken Language Understanding.” arXiv. https://doi.org/10.48550/ARXIV.1909.02188.

Ram, Ori, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. “In-Context Retrieval-Augmented Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2302.00083.

Ramesh, Krithika, Sunayana Sitaram, and Monojit Choudhury. 2023. “Fairness in Language Models Beyond English: Gaps and Challenges.” arXiv. https://doi.org/10.48550/ARXIV.2302.12578.

Reddy, Siva, Danqi Chen, and Christopher D. Manning. 2018. “CoQA: A Conversational Question Answering Challenge.” arXiv. https://doi.org/10.48550/ARXIV.1808.07042.

“Replication Data for: Not so Harmless After All: The Fixed-Effects Model.” 2017. https://doi.org/10.7910/DVN/RAUIHG.

Reynolds, Laria, and Kyle McDonell. 2021. “Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm.” arXiv. https://doi.org/10.48550/ARXIV.2102.07350.

Ribeiro, Marco Tulio, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. “Beyond Accuracy: Behavioral Testing of NLP Models with CheckList.” arXiv. https://doi.org/10.48550/ARXIV.2005.04118.

Ridnik, Tal, Dedy Kredo, and Itamar Friedman. 2024. “Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering.” arXiv. https://doi.org/10.48550/ARXIV.2401.08500.

Roberts, Angus, Robert Gaizauskas, Mark Hepple, George Demetriou, Yikun Guo, Ian Roberts, and Andrea Setzer. 2009. “Building a Semantically Annotated Corpus of Clinical Texts.” Journal of Biomedical Informatics 42 (October). https://doi.org/10.1016/j.jbi.2008.12.013.

Rospigliosi, Pericles ‘asher’. 2023. “Artificial Intelligence in Teaching and Learning: What Questions Should We Ask of ChatGPT?” Interactive Learning Environments 31 (January). https://doi.org/10.1080/10494820.2023.2180191.

Röttger, Paul, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. 2023. “XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2308.01263.

Ruan, Jingqing, Yihong Chen, Bin Zhang, Zhiwei Xu, Tianpeng Bao, Guoqing Du, Shiwei Shi, et al. 2023. “TPTU: Large Language Model-Based AI Agents for Task Planning and Tool Usage.” arXiv. https://doi.org/10.48550/ARXIV.2308.03427.

Sadat, Fatiha, and Nizar Habash. 2006. “Combination of Arabic Preprocessing Schemes for Statistical Machine Translation.” Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the ACL - ACL ’06. https://doi.org/10.3115/1220175.1220176.

Sailunaz, Kashfia. 2018. “Emotion and Sentiment Analysis from Twitter Text,” July. https://doi.org/10.11575/PRISM/32714.

Sakai, Takao, Masatsugu Ohta, Yusuke Furukawa, Yumiko Saga, Shinichi Aizawa, Hisaaki Kawakatsu, and Masaki Saito. 1995. “Tenascin‐c Induction by the Diffusible Factor Epidermal Growth Factor in Stromal‐epithelial Interactions.” Journal of Cellular Physiology 165 (October). https://doi.org/10.1002/jcp.1041650104.

Samaan, Jamil S., Yee Hui Yeo, Nithya Rajeev, Lauren Hawley, Stuart Abel, Wee Han Ng, Nitin Srinivasan, et al. 2023. “Assessing the Accuracy of Responses by the Language Model ChatGPT to Questions Regarding Bariatric Surgery.” Obesity Surgery 33 (April). https://doi.org/10.1007/s11695-023-06603-5.

Santurkar, Shibani, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. 2023. “Whose Opinions Do Language Models Reflect?” arXiv. https://doi.org/10.48550/ARXIV.2303.17548.

Sap, Maarten, Ronan LeBras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A. Smith, and Yejin Choi. 2018. “ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning.” arXiv. https://doi.org/10.48550/ARXIV.1811.00146.

Saul, Lawrence, and Fernando Pereira. 1997. “Aggregate and Mixed-Order Markov Models for Statistical Language Processing.” arXiv. https://doi.org/10.48550/ARXIV.CMP-LG/9706007.

Semnani, Sina J., Violet Z. Yao, Heidi C. Zhang, and Monica S. Lam. 2023. “WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia.” arXiv. https://doi.org/10.48550/ARXIV.2305.14292.

Serra, Giane Moliari Amaral, Inesita Soares de Araujo, and Elizabeth Moreira dos Santos. 2012. “&Lt;b>comer Com Os Olhos: Discursos Televisivos e Produção de Sentidos Na Promoção Da Saúde Nutricional de Adolescentes</b> - DOI: 10.3395/Reciis.v6i4.682pt.” RECIIS 6 (December). https://doi.org/10.3395/reciis.v6i4.682pt.

Shah, Chirag. 2024. “From Prompt Engineering to Prompt Science with Human in the Loop.” arXiv. https://doi.org/10.48550/ARXIV.2401.04122.

Shah, Pararth, Dilek Hakkani-Tür, Gokhan Tür, Abhinav Rastogi, Ankur Bapna, Neha Nayak, and Larry Heck. 2018. “Building a Conversational Agent Overnight with Dialogue Self-Play.” arXiv. https://doi.org/10.48550/ARXIV.1801.04871.

Shakarian, Paulo, Abhinav Koyyalamudi, Noel Ngu, and Lakshmivihari Mareedu. 2023. “An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP).” arXiv. https://doi.org/10.48550/ARXIV.2302.13814.

Shao, Zhihong, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. “Synthetic Prompting: Generating Chain-of-Thought Demonstrations for Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2302.00618.

Shen, Junxiao, John J. Dudley, Jingyao Zheng, Bill Byrne, and Per Ola Kristensson. 2023. “Promptor: A Conversational and Autonomous Prompt Generation Agent for Intelligent Text Entry Techniques.” arXiv. https://doi.org/10.48550/ARXIV.2310.08101.

Shen, Lingfeng, Weiting Tan, Boyuan Zheng, and Daniel Khashabi. 2023. “Flatness-Aware Prompt Selection Improves Accuracy and Sample Efficiency.” arXiv. https://doi.org/10.48550/ARXIV.2305.10713.

Shen, Si, Jiangfeng Liu, Litao Lin, Ying Huang, Lin Zhang, Chang Liu, Yutong Feng, and Dongbo Wang. 2022. “SsciBERT: A Pre-Trained Language Model for Social Science Texts.” arXiv. https://doi.org/10.48550/ARXIV.2206.04510.

Shi, Fobo, Peijun Qing, Dong Yang, Nan Wang, Youbo Lei, Haonan Lu, and Xiaodong Lin. 2023. “Prompt Space Optimizing Few-Shot Reasoning Success with Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2306.03799.

Shi, Wei, Siyuan Zhang, Zhiwei Zhang, Hong Cheng, and Jeffrey Xu Yu. 2020. “Joint Embedding in Named Entity Linking on Sentence Level.” arXiv. https://doi.org/10.48550/ARXIV.2002.04936.

Shinn, Noah, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. “Reflexion: Language Agents with Verbal Reinforcement Learning.” arXiv. https://doi.org/10.48550/ARXIV.2303.11366.

Shum, Heung-Yeung, Xiaodong He, and Di Li. 2018. “From Eliza to XiaoIce: Challenges and Opportunities with Social Chatbots.” arXiv. https://doi.org/10.48550/ARXIV.1801.01957.

Singhal, Karan, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, et al. 2023. “Large Language Models Encode Clinical Knowledge.” Nature 620 (July). https://doi.org/10.1038/s41586-023-06291-2.

Smith, Noah A. 2020. “Contextual Word Representations.” Communications of the ACM 63 (May). https://doi.org/10.1145/3347145.

Sorensen, Taylor, Joshua Robinson, Christopher Rytting, Alexander Shaw, Kyle Rogers, Alexia Delorey, Mahmoud Khalil, Nancy Fulda, and David Wingate. 2022. “An Information-Theoretic Approach to Prompt Engineering Without Ground Truth Labels.” Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). https://doi.org/10.18653/v1/2022.acl-long.60.

Spangher, Alexander, and Jonathan May. 2021. “NewsEdits: A Dataset of Revision Histories for News Articles (Technical Report: Data Processing).” arXiv. https://doi.org/10.48550/ARXIV.2104.09647.

Strimel, Grant P., Kanthashree Mysore Sathyendra, and Stanislav Peshterliev. 2018. “Statistical Model Compression for Small-Footprint Natural Language Understanding.” arXiv. https://doi.org/10.48550/ARXIV.1807.07520.

Su, Guinan, Yanwu Yang, and Jie Guo. 2023. “Prompt Your Mind: Refine Personalized Text Prompts Within Your Mind.” arXiv. https://doi.org/10.48550/ARXIV.2311.05114.

Su, Zhenlin, Liyan Xu, Jin Xu, Jiangnan Li, and Mingdu Huangfu. 2023. “SIG: Speaker Identification in Literature via Prompt-Based Generation.” arXiv. https://doi.org/10.48550/ARXIV.2312.14590.

Şulea, Octavia-Maria, Marcos Zampieri, Mihaela Vela, and Josef van Genabith. 2017. “Predicting the Law Area and Decisions of French Supreme Court Cases.” RANLP 2017 - Recent Advances in Natural Language Processing Meet Deep Learning, November. https://doi.org/10.26615/978-954-452-049-6_092.

Sun, Yuhan, Mukai Li, Yixin Cao, Kun Wang, Wenxiao Wang, Xingyu Zeng, and Rui Zhao. 2023. “To Be or Not to Be? An Exploration of Continuously Controllable Prompt Engineering.” arXiv. https://doi.org/10.48550/ARXIV.2311.09773.

Sun, Zhiqing, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. 2023. “Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision.” arXiv. https://doi.org/10.48550/ARXIV.2305.03047.

Suzgun, Mirac, and Adam Tauman Kalai. 2024. “Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding.” arXiv. https://doi.org/10.48550/ARXIV.2401.12954.

Tan, Hexiang, Fei Sun, Wanli Yang, Yuanzhuo Wang, Qi Cao, and Xueqi Cheng. 2024. “Blinded by Generated Contexts: How Language Models Merge Generated and Retrieved Contexts for Open-Domain QA?” arXiv. https://doi.org/10.48550/ARXIV.2401.11911.

Tang, Jerry, Amanda LeBel, Shailee Jain, and Alexander G. Huth. 2022. “Semantic Reconstruction of Continuous Language from Non-Invasive Brain Recordings,” September. https://doi.org/10.1101/2022.09.29.509744.

Taylor, Niall, Yi Zhang, Dan Joyce, Alejo Nevado-Holgado, and Andrey Kormilitzin. 2022. “Clinical Prompt Learning with Frozen Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2205.05535.

Teh, Yee Whye. 2006. “A Hierarchical Bayesian Language Model Based on Pitman-Yor Processes.” Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the ACL - ACL ’06. https://doi.org/10.3115/1220175.1220299.

Tian, Jacob-Junqi, David Emerson, Sevil Zanjani Miyandoab, Deval Pandya, Laleh Seyyed-Kalantari, and Faiza Khan Khattak. 2023. “Soft-Prompt Tuning for Large Language Models to Evaluate Bias.” arXiv. https://doi.org/10.48550/ARXIV.2306.04735.

“Tools Such as ChatGPT Threaten Transparent Science; Here Are Our Ground Rules for Their Use.” 2023. Nature 613 (January). https://doi.org/10.1038/d41586-023-00191-1.

Toraman, Cagri, Eyup Halit Yilmaz, Furkan Şahi̇nuç, and Oguzhan Ozcelik. 2023. “Impact of Tokenization on Language Models: An Analysis for Turkish.” ACM Transactions on Asian and Low-Resource Language Information Processing 22 (March). https://doi.org/10.1145/3578707.

Toyer, Sam, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, et al. 2023. “Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game.” arXiv. https://doi.org/10.48550/ARXIV.2311.01011.

Trautmann, Dietrich, Alina Petrova, and Frank Schilder. 2022. “Legal Prompt Engineering for Multilingual Legal Judgement Prediction.” arXiv. https://doi.org/10.48550/ARXIV.2212.02199.

Tufano, Michele, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, and Neel Sundaresan. 2020. “Unit Test Case Generation with Transformers and Focal Context.” arXiv. https://doi.org/10.48550/ARXIV.2009.05617.

Turney, Peter D. 2002. “Mining the Web for Synonyms: PMI-IR Versus LSA on TOEFL.” arXiv. https://doi.org/10.48550/ARXIV.CS/0212033.

Turney, Peter D., Michael L. Littman, Jeffrey Bigham, and Victor Shnayder. 2003. “Combining Independent Modules to Solve Multiple-Choice Synonym and Analogy Problems.” arXiv. https://doi.org/10.48550/ARXIV.CS/0309035.

Tymoshenko, Kateryna, and Alessandro Moschitti. 2015. “Assessing the Impact of Syntactic and Semantic Structures for Answer Passages Reranking.” Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, October. https://doi.org/10.1145/2806416.2806490.

Ullman, Tomer. 2023. “Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks.” arXiv. https://doi.org/10.48550/ARXIV.2302.08399.

Vries, Erik de, Martijn Schoonvelde, and Gijs Schumacher. 2018. “No Longer Lost in Translation: Evidence That Google Translate Works for Comparative Bag-of-Words Text Applications.” Political Analysis 26 (September). https://doi.org/10.1017/pan.2018.26.

Wang, Haochun, Sendong Zhao, Chi Liu, Nuwa Xi, Muzhen Cai, Bing Qin, and Ting Liu. 2023. “Manifold-Based Verbalizer Space Re-Embedding for Tuning-Free Prompt-Based Classification.” arXiv. https://doi.org/10.48550/ARXIV.2309.04174.

Wang, Jindong, Xixu Hu, Wenxin Hou, Hao Chen, Runkai Zheng, Yidong Wang, Linyi Yang, et al. 2023. “On the Robustness of ChatGPT: An Adversarial and Out-of-Distribution Perspective.” arXiv. https://doi.org/10.48550/ARXIV.2302.12095.

Wang, Longyue, Chenyang Lyu, Tianbo Ji, Zhirui Zhang, Dian Yu, Shuming Shi, and Zhaopeng Tu. 2023. “Document-Level Machine Translation with Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2304.02210.

Wang, Luyu, Yujia Li, Ozlem Aslan, and Oriol Vinyals. 2021. “WikiGraphs: A Wikipedia Text - Knowledge Graph Paired Dataset.” arXiv. https://doi.org/10.48550/ARXIV.2107.09556.

Wang, Qingyue, Liang Ding, Yanan Cao, Zhiliang Tian, Shi Wang, Dacheng Tao, and Li Guo. 2023. “Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2308.15022.

Wang, William Yang. 2017. “"Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection.” arXiv. https://doi.org/10.48550/ARXIV.1705.00648.

Wang, Xintao, Zhouhong Gu, Jiaqing Liang, Dakuan Lu, Yanghua Xiao, and Wei Wang. 2024. “ConcEPT: Concept-Enhanced Pre-Training for Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2401.05669.

Wang, Xinyi, Wanrong Zhu, Michael Saxon, Mark Steyvers, and William Yang Wang. 2023. “Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for in-Context Learning.” arXiv. https://doi.org/10.48550/ARXIV.2301.11916.

Wang, Xun, Tao Ge, Allen Mao, Yuki Li, Furu Wei, and Si-Qing Chen. 2022. “Pay Attention to Your Tone: Introducing a New Dataset for Polite Language Rewrite.” arXiv. https://doi.org/10.48550/ARXIV.2212.10190.

Wang, Yau-Shian, and Yingshan Chang. 2022. “Toxicity Detection with Generative Prompt-Based Inference.” arXiv. https://doi.org/10.48550/ARXIV.2205.12390.

Wang, Yen-Jen, Bike Zhang, Jianyu Chen, and Koushil Sreenath. 2023. “Prompt a Robot to Walk with Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2309.09969.

Wang, Yue, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, and Steven C. H. Hoi. 2023. “CodeT5+: Open Code Large Language Models for Code Understanding and Generation.” arXiv. https://doi.org/10.48550/ARXIV.2305.07922.

Wang, Zengzhi, Qiming Xie, Zixiang Ding, Yi Feng, and Rui Xia. 2023. “Is ChatGPT a Good Sentiment Analyzer? A Preliminary Study.” arXiv. https://doi.org/10.48550/ARXIV.2304.04339.

Weizenbaum, Joseph. 1966. “ELIZA—a Computer Program for the Study of Natural Language Communication Between Man and Machine.” Communications of the ACM 9 (January). https://doi.org/10.1145/365153.365168.

Welbl, Johannes, Pontus Stenetorp, and Sebastian Riedel. 2017. “Constructing Datasets for Multi-Hop Reading Comprehension Across Documents.” arXiv. https://doi.org/10.48550/ARXIV.1710.06481.

Wen, Yuxin, Neel Jain, John Kirchenbauer, Micah Goldblum, Jonas Geiping, and Tom Goldstein. 2023. “Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery.” arXiv. https://doi.org/10.48550/ARXIV.2302.03668.

Wessel, Martin, Tomás Horych, Terry Ruas, Akiko Aizawa, Bela Gipp, and Timo Spinde. 2023. “Introducing MBIB - the First Media Bias Identification Benchmark Task and Dataset Collection.” Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, July. https://doi.org/10.1145/3539618.3591882.

West, Colin G. 2023. “AI and the FCI: Can ChatGPT Project an Understanding of Introductory Physics?” arXiv. https://doi.org/10.48550/ARXIV.2303.01067.

Westergaard, David, Hans-Henrik Stærfeldt, Christian Tønsberg, Lars Juhl Jensen, and Søren Brunak. 2017. “Text Mining of 15 Million Full-Text Scientific Articles,” July. https://doi.org/10.1101/162099.

White, Andrew D., Glen M. Hocky, Heta A. Gandhi, Mehrad Ansari, Sam Cox, Geemi P. Wellawatte, Subarna Sasmal, et al. 2023. “Assessment of Chemistry Knowledge in Large Language Models That Generate Code.” Digital Discovery 2. https://doi.org/10.1039/d2dd00087c.

Wichers, Nevan, Carson Denison, and Ahmad Beirami. 2024. “Gradient-Based Language Model Red Teaming.” arXiv. https://doi.org/10.48550/ARXIV.2401.16656.

Williams, Adina, Nikita Nangia, and Samuel R. Bowman. 2017. “A Broad-Coverage Challenge Corpus for Sentence Understanding Through Inference.” arXiv. https://doi.org/10.48550/ARXIV.1704.05426.

Wilson, Theresa, Janyce Wiebe, and Paul Hoffmann. 2009. “Recognizing Contextual Polarity: An Exploration of Features for Phrase-Level Sentiment Analysis.” Computational Linguistics 35 (September). https://doi.org/10.1162/coli.08-012-r1-06-90.

Witten, Ian H., Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G. Nevill-Manning. 1999. “KEA: Practical Automatic Keyphrase Extraction.” arXiv. https://doi.org/10.48550/ARXIV.CS/9902007.

Wolf, Yotam, Noam Wies, Oshri Avnery, Yoav Levine, and Amnon Shashua. 2023. “Fundamental Limitations of Alignment in Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2304.11082.

Wu, Shijie, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. 2023. “BloombergGPT: A Large Language Model for Finance.” arXiv. https://doi.org/10.48550/ARXIV.2303.17564.

Wu, Zhaofeng, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. 2023. “Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks.” arXiv. https://doi.org/10.48550/ARXIV.2307.02477.

Xi, Zhiheng, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, et al. 2023. “The Rise and Potential of Large Language Model Based Agents: A Survey.” arXiv. https://doi.org/10.48550/ARXIV.2309.07864.

Xie, Sang Michael, Shibani Santurkar, Tengyu Ma, and Percy Liang. 2023. “Data Selection for Language Models via Importance Resampling.” arXiv. https://doi.org/10.48550/ARXIV.2302.03169.

Xie, Shangyu, Wei Dai, Esha Ghosh, Sambuddha Roy, Dan Schwartz, and Kim Laine. 2023. “Does Prompt-Tuning Language Model Ensure Privacy?” arXiv. https://doi.org/10.48550/ARXIV.2304.03472.

Xu, Fei, and Joshua B. Tenenbaum. 2007. “Word Learning as Bayesian Inference.” Psychological Review 114. https://doi.org/10.1037/0033-295x.114.2.245.

Xu, Weijia, Andrzej Banburski-Fahey, and Nebojsa Jojic. 2023. “Reprompting: Automated Chain-of-Thought Prompt Inference Through Gibbs Sampling.” arXiv. https://doi.org/10.48550/ARXIV.2305.09993.

Xu, Xilie, Keyi Kong, Ning Liu, Lizhen Cui, Di Wang, Jingfeng Zhang, and Mohan Kankanhalli. 2023. “An LLM Can Fool Itself: A Prompt-Based Adversarial Attack.” arXiv. https://doi.org/10.48550/ARXIV.2310.13345.

Yamada, Masaru. 2023. “Optimizing Machine Translation Through Prompt Engineering: An Investigation into ChatGPT’s Customizability.” arXiv. https://doi.org/10.48550/ARXIV.2308.01391.

Yan, Xue, Yan Song, Xinyu Cui, Filippos Christianos, Haifeng Zhang, David Henry Mguni, and Jun Wang. 2023. “Ask More, Know Better: Reinforce-Learned Prompt Questions for Decision Making with Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2310.18127.

Yang, Jian, Xinyu Hu, Gang Xiao, and Yulong Shen. 2021. “A Survey of Knowledge Enhanced Pre-Trained Models,” October. http://arxiv.org/abs/2110.00269v5.

Yang, Jingfeng, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, and Xia Hu. 2023. “Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond.” arXiv. https://doi.org/10.48550/ARXIV.2304.13712.

Yang, Kai-Cheng, and Filippo Menczer. 2023. “Large Language Models Can Rate News Outlet Credibility.” arXiv. https://doi.org/10.48550/ARXIV.2304.00228.

Yang, Linyao, Hongyang Chen, Zhao Li, Xiao Ding, and Xindong Wu. 2023. “Give Us the Facts: Enhancing Large Language Models with Knowledge Graphs for Fact-Aware Language Modeling.” arXiv. https://doi.org/10.48550/ARXIV.2306.11489.

Yang, Zetong, Li Jiang, Yanan Sun, Bernt Schiele, and Jiaya Jia. 2022. “A Unified Query-Based Paradigm for Point Cloud Understanding.” arXiv. https://doi.org/10.48550/ARXIV.2203.01252.

Yao, Shunyu, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2305.10601.

Yao, Zonghai, Ahmed Jaafar, Beining Wang, Yue Zhu, Zhichao Yang, and Hong Yu. 2023. “Do Physicians Know How to Prompt? The Need for Automatic Prompt Optimization Help in Clinical Note Generation.” arXiv. https://doi.org/10.48550/ARXIV.2311.09684.

Ye, Qinyuan, Maxamed Axmed, Reid Pryzant, and Fereshte Khani. 2023. “Prompt Engineering a Prompt Engineer.” arXiv. https://doi.org/10.48550/ARXIV.2311.05661.

Ye, Yunhu, Binyuan Hui, Min Yang, Binhua Li, Fei Huang, and Yongbin Li. 2023. “Large Language Models Are Versatile Decomposers: Decompose Evidence and Questions for Table-Based Reasoning.” arXiv. https://doi.org/10.48550/ARXIV.2301.13808.

Yi, Jingwei, Yueqi Xie, Bin Zhu, Keegan Hines, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. 2023. “Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2312.14197.

Yuan, Weizhe, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. 2024. “Self-Rewarding Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2401.10020.

Yuan, Zheng, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. 2023. “RRHF: Rank Responses to Align Language Models with Human Feedback Without Tears.” arXiv. https://doi.org/10.48550/ARXIV.2304.05302.

Zack, Travis, Eric Lehman, Mirac Suzgun, Jorge A Rodriguez, Leo Anthony Celi, Judy Gichoya, Dan Jurafsky, et al. 2024. “Assessing the Potential of GPT-4 to Perpetuate Racial and Gender Biases in Health Care: A Model Evaluation Study.” The Lancet Digital Health 6 (January). https://doi.org/10.1016/s2589-7500(23)00225-x.

Zamfirescu-Pereira, J. D., Bjoern Hartmann, and Qian Yang. 2023. “Conversation Regression Testing: A Design Technique for Prototyping Generalizable Prompt Strategies for Pre-Trained Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2302.03154.

Zellers, Rowan, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. “SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference.” arXiv. https://doi.org/10.48550/ARXIV.1808.05326.

Zeng, Mingliang, Xu Tan, Rui Wang, Zeqian Ju, Tao Qin, and Tie-Yan Liu. 2021. “MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training.” arXiv. https://doi.org/10.48550/ARXIV.2106.05630.

Zhang, Chaoning, Chenshuang Zhang, Chenghao Li, Yu Qiao, Sheng Zheng, Sumit Kumar Dam, Mengchun Zhang, et al. 2023. “One Small Step for Generative AI, One Giant Leap for AGI: A Complete Survey on ChatGPT in AIGC Era.” arXiv. https://doi.org/10.48550/ARXIV.2304.06488.

Zhang, Jingqing, Yao Zhao, Mohammad Saleh, and Peter J. Liu. 2019. “PEGASUS: Pre-Training with Extracted Gap-Sentences for Abstractive Summarization.” arXiv. https://doi.org/10.48550/ARXIV.1912.08777.

Zhang, Jizhi, Keqin Bao, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023. “Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large Language Model Recommendation.” Proceedings of the 17th ACM Conference on Recommender Systems, September. https://doi.org/10.1145/3604915.3608860.

Zhang, Peiyuan, Guangtao Zeng, Tianduo Wang, and Wei Lu. 2024. “TinyLlama: An Open-Source Small Language Model.” arXiv. https://doi.org/10.48550/ARXIV.2401.02385.

Zhang, Wenjie, Xiaoning Song, Zhenhua Feng, Tianyang Xu, and Xiaojun Wu. 2023. “LabelPrompt: Effective Prompt-Based Learning for Relation Classification.” arXiv. https://doi.org/10.48550/ARXIV.2302.08068.

Zhang, Yue, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, et al. 2023. “Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2309.01219.

Zhang, Zhuosheng, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. 2023. “Multimodal Chain-of-Thought Reasoning in Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2302.00923.

Zheng, Chuanyang, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. 2023. “Progressive-Hint Prompting Improves Reasoning in Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2304.09797.

Zhong, Wanjun, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. 2023. “AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models.” arXiv. https://doi.org/10.48550/ARXIV.2304.06364.

Zhou, Kaitlyn, Dan Jurafsky, and Tatsunori Hashimoto. 2023. “Navigating the Grey Area: How Expressions of Uncertainty and Overconfidence Affect Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2302.13439.

Zhou, Yongchao, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2022. “Large Language Models Are Human-Level Prompt Engineers.” arXiv. https://doi.org/10.48550/ARXIV.2211.01910.

Zhu, Xuekai, Yao Fu, Bowen Zhou, and Zhouhan Lin. 2024. “Critical Data Size of Language Models from a Grokking Perspective.” arXiv. https://doi.org/10.48550/ARXIV.2401.10463.

Zhuo, Terry Yue, Yujin Huang, Chunyang Chen, and Zhenchang Xing. 2023. “Red Teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity.” arXiv. https://doi.org/10.48550/ARXIV.2301.12867.

Ziems, Caleb, William Held, Omar Shaikh, Jiaao Chen, Zhehao Zhang, and Diyi Yang. 2023. “Can Large Language Models Transform Computational Social Science?” arXiv. https://doi.org/10.48550/ARXIV.2305.03514.

Zollo, Thomas P., Todd Morrill, Zhun Deng, Jake C. Snell, Toniann Pitassi, and Richard Zemel. 2023. “Prompt Risk Control: A Rigorous Framework for Responsible Deployment of Large Language Models.” arXiv. https://doi.org/10.48550/ARXIV.2311.13628.

Zuccon, Guido, and Bevan Koopman. 2023. “Dr ChatGPT, Tell Me What i Want to Hear: How Prompt Knowledge Impacts Health Answer Correctness.” arXiv. https://doi.org/10.48550/ARXIV.2302.13793.